


SFRINDDICAL ROOR 
yEreRaAl LIGRARY 
gt y, ¢! MiCm@, 


Vol. XXXV MARCH, 1944 No. 3 


The Journal of Educational 
Psychology 


Devoted Primarily to the Scientific Study of Problems of Learning and Teaching 





CONTENTS 


A Comparative Study of Different Forms of Spelling Tests . . . . 129 
DAVID 8. BRODY 


The Concepts of the Profile, Psychograph, and Evalograph .. . . 145 
MORSE P. MANSON 


Confusion in Educational M easurement............ 157 
W. A. SAUCIER 


The Interpretation of Frequency Ratings Obtained from “The 
i, CE ccs Bie d's bs we) 6 © 8 6 Moe 169 


FREDERICK B. DAVIS 


Reliability of Multiple-Choice Tests as a Function of Number of 
I Sd ne CNG Ae en iow <8 o 6 le eee 175 
FREDERIC M. LORD 


Techniques Used in Analyzing the Learning Achievement of Naval 
RISES SG DRS #3 a es a a 181 


ELLIS WEITZMAN AND WALTER J. MCNAMARA 


26.00 per Year - Published Monthly September to May 


WARWICK & YORK, INC. 


BALTIMORE, MD. 


Entered as Second Class Matter Nov. 15, 1921, at the Post Office at Baltimore, Md. 
under the Act of March 3, 1879; additional entry as Second Class Matter at York, Pa. 


way 10 1944 





- 
i 
; 
Li 


ee cee 


at TE i 





THE JOURNAL OF 
Educational Psychology 


Established 1910 


EDITED BY 


ack W. Dun ap 
niversity of Rochester 


IN ASSOCIATION WITH 





STEPHEN M. Corey H. H. Remmers 
University of Chicago Purdue University 

oHN G. DaRLEY Prercivat M. Symonps 

niversity of Minnesota Teachers College (Columbia University) 
BerTHA PETERSON HarRPER Pau. A. Witty 
University of Rochester Northwestern University 
Harotp E. Jones H. E. Bucuno.z 
University of California Managing Editor 


5 pee Journat oF EpucatTionaL PsycHoLocy is devoted pri- 
marily to the scientific study of problems of learning, teaching, 
and measurement of the psychological development of the indi- 
vidual. THE JouRNAL will contain articles on the following sub- 
jects: the psychology of school subjects; experimental studies of 
learning; the development of interests, attitudes, and personality, 
particularly as related to school adjustment; emotion, motivation, 
and character; mental development and methods. This last will 
include tests, statistical techniques, and research techniques in 
cross-sectional and developmental studies. 

Dr. Jack W. Dunlap is now on leave of absence from the 
University of Rochester and serving as Lieut. Commander in the 
U. S. Navy. For the ‘duration’, therefore, manuscripts, books 
and other materials for review, and correspondence regarding 
editorial matters should be addressed to H. E. Buchholz, Warwick 
& York, 10 E. Centre St., Baltimore, Md. 

Manuscripts should be typed and double-spaced throughout 
including quotations, footnotes, and references. Return postage 
should be included with all unsolicited manuscripts. 

THE JouRNAL is published monthly from September to May. 
The price per year in the United States and Pan-American coun- 
tries is $6.00; $6.20 to Canada; and $6.40 to other foreign countries. 
Part-year subscriptions are 90 cents per issue ordered. Back vol- 
umes are $7.00 each, and back issues $1.10. 

Subscribers should notify the Publishers of change of address 
at least four weeks in advance of publication of issue with which 
the change is to take effect. Claims for non-receipt of an issue will 
not be honored unless made within two weeks after receipt of next 
succeeding number. 


WARWICK AND YORK Publishers BALTimMoreE, Mp. 











THE JOURNAL OF 
EDUCATIONAL PSYCHOLOGY 








Volume XXXV March, 1944 Number 3 








A COMPARATIVE STUDY OF DIFFERENT FORMS 
OF SPELLING TESTS* 


DAVID 8. BRODY 


The method most frequently employed for measuring the 
spelling ability of school children has been the list test in which 
the teacher dictates one word at a time to be written by the pupil. 
It has been assumed that the ability to spell a word as measured 
by such a test is an accurate index of the ability to spell under 
widely different circumstances such as letter-writing, taking 
notes, writing reports, and detecting spelling errors in proof- 
reading. A preliminary survey in the Minneapolis schools 
indicated that the difficulty of spelling words as determined by a 
list test is not necessarily a true indication of the difficulty of 
words in a functional situation. It appeared that words in a 
contexual setting were more difficult to spell than when presented 
in list form. 

Northby® and Moore‘ in their analyses of different forms of 
spelling tests administered to sixth-grade pupils also found 
that the difficulty of spelling words is in part a function of the 
type of test employed. Their findings indicate that a multiple 
choice recognition test tends to be the least difficult, whereas 
the timed dictation and story form tests are the most difficult. 
Both studies are suggestive of a number of problems with respect 
to further research. 

In the present study, several different forms of spelling tests 
were subjected to a detailed analysis for purposes of determining: 
(1) the comparative difficulty of various types of spelling tests 
for pupils at different grade levels, (2) the effect of variations of 
paragraph context upon the difficulty of spelling words in both 





* The research on which this article is based was conducted as a WPA 
official project No. 465-71-3-279, Work Project No. 6088, under the 
sponsorship of the Board of Education of the City of Minneapolis. 

129 





FR ee EE Se Seg ere + 





130 The Journal of Educational Psychology 


‘recognition’ and ‘recall’ tests, and (3) the extent to which 
different types of spelling tests are measures of the same or 
different abilities. 

The subjects in this study consisted of approximately twelve 
hundred pupils selected from grades IV through IX in the 
Minneapolis schools. Care was taken to select pupils from 
many different sections of the city in order to obtain a sample 
which was representative of the school population as a whole. 
The number of pupils in each grade for whom test records were 


secured is as follows: 
Grade Number of Pupils 


IV 179 
V 176 
VI 174 
VII 223 
Vill 244 
IX 235 


Sixty words presenting a wide range in spelling difficulty were 
selected from the Iowa! and the Buckingham-Ayres? spelling 
scales. These words were classified into six lists of ten words 
each on the following basis: List IV—words of average difficulty 
for fourth-grade pupils, List V—words of average difficulty for 
fifth-grade pupils, List VI-1 and VI-2—words of average difficulty 
for sixth-grade pupils, List VII—words of average difficulty for 
seventh-grade pupils, and List VIII]—words of average difficulty 
for eighth-grade pupils. With one or two exceptions all words 
were selected from the thirty-fifth to the sixty-fifth per cent 
levels of difficulty for pupils in each of the respective grades. 

As a preliminary step a number of children in each grade from 
IV through VIII were asked to write a paragraph containing 
the ten words which were of average difficulty for their respective 
grade. The best paragraph written at each grade level was 
selected for use in the present study. These paragraphs were 
considered to be of average contextual difficulty and were used 
as guides in constructing ‘easy context’ and ‘difficult context’ 
paragraphs. In preparing the ‘easy context’ paragraphs, the 
writer employed a vocabulary which he judged as simpler than 
that used by the pupil who had composed the ‘average context’ 
paragraph. In a similar manner, ‘difficult context’ paragraphs 








Comparative Study of Different Forms of Spelling Tests 131 


were prepared by employing more difficult vocabulary. The 
only common factor in the three paragraphs was the basic list 
of ten words. 

A total of eighteen paragraphs were prepared, three for each 
of the word lists. These paragraphs were administered three 
at a time every other day for a period of two weeks. In adminis- 
tering these tests forms, the teachers read the paragraphs to the 
pupils who were instructed to write exactly what was dictated. 
The copies of the paragraphs that were presented to the teachers 
gave no indications as to which were the key words. The order 
of presenting the paragraphs was so arranged that the words 
in any single list were not presented successively, but were 
separated by an interval of at least six testing units. 

The three paragraphs prepared for each of the six word lists 
were also presented in mimeographed form to the pupils. The 
ten key words in each paragraph were misspelled, and the pupils 
were told to underline all incorrectly spelled words. No indica- 
tion as to the number of misspelled words in each paragraph was 
given. The misspellings employed were those which Gates? 
found to be the most frequent among school children. The 
eighteen proof-reading paragraphs were presented three at a 
time every other day during the third and fourth weeks of the 
testing period. 

The sixty key words were classified into two sets of thirty 
words each, Set I containing the words in Lists IV, V, and VI-1, 
and Set II containing the words in Lists VI-2, VII, and VIII. 
These sets were administered one at a time on different days 
during the fifth week. 

A series of sixty sentences each of which contained one of the 
key words was also prepared. As in the case of the list test, the 
sentences were classified into two sets of thirty sentences each. 
These sets were administered one at a time on different days 


during the sixth week. 


RESULTS 


For each child scores were obtained consisting of the number 
of words spelled correctly or words recognized as incorrectly 
spelled out of the total list of sixty words in the eight different 
situations. In Table I the mean and standard deviation of the 
scores on the list, sentence, dictated paragraph, and proof-reading 





The Journal of Educational Psychology 


132 


4> Sears 
























































6h'6 |Z0' OFIIG IT/€8 SELF 71/66 ZEE9 ZIIFO LZ 1S OTOL ZZIST OTII9 LT)" °° °° ‘Burpwoy joorg ynowiq 
¥9'°S8 |6E° SFOS IIPS 6E/Z8°Z1\60 SEi9S ST\99 OFiZE OT\E9 SZIIZ ZII06 TZ) Bulpeay joolg ssv1oaAy 
€9°L |€8°OS|F8°6 [OL FFI9S IIIS OF|9S ZI\LZ'SEiFI'6 [hE 6ZITT O1/G6°Sz|" °° °° “Burpwoy joorg Aseq 
OI OL/E6 OFS ZIG IF/ZO' FIFI SE|IO S1I9G Szi8s ITO L1IZZ°6 92° ZI adeadereg ynouig 
ZS'6 SE LEZSE TIGO IFIP FIGS FES ZIOL 2ZZFS II|Ih 91IZe'6 OS Il|' °°  “Ydeiseieg oBe10ay 
Z8'°6 |Z0 9FIES ZOE OF/S9 STIEh ZEII6 I1IZS FZ/09 OLI6F STIOLZ’S |ZT IT] CC adeidereg Aseq 
L1°6 |00° 64/99 ZI\68 ZFIIE S169 9E\SS%‘ ZSIIPL STIG ITS9' 6TIOF' OTIVA ET) 90U9}U8g 
69°8 |26° 6F/06' OT/Z8° SF/ZZ° 710% LE/69° S109 6Z\ZZ° 11|Z9°0Z196'6 |Z8' Ell qsv'] 
Gs [eve] GS jVVeW| GS UVa! GS juve! Gg juve] Gg juve 

Owe & ee @6.4.6 66 8°00 868 489.1, jo ed, 
XI 9PBIy [TILA °PF1H | ITA °peiy | TA epein | A epely | AT opin 

SLSA], 


ONIGVAN-AOOUd GNV SHdVUNVUV"G GALVLOIG] AHL NO SNOILVIAG(] GUVAGNVLG UNV SaxO0g NvAP—'[ WIAvV], 





Comparative Study of Different Forms of Spelling Tests 133 


tests are presented for each grade level. It will be noted that 
the mean scores on the list and sentence tests are extremely 
close and that the differences which do exist are very small. 
However, these differences are consistent and show that the 
sentence test is slightly more difficult at each grade level. Some 
slight differences in variability of scores are also noticeable. 





6o 





50 Z 





é 
x 





fest Scores 
w 
oO 
y 
\ 











7 comma (ist Test 
comemeaes Sontence Test 





Ld 
oO 
N 





te) 


























a 5 6 7 6 9 


Grade Level 
Fia. 1.—Mean scores on the list’ and sentence tests by grade level. 


With the exception of the sixth grade, the standard deviations of 
scores on the sentence test are somewhat higher than they are 
on the list test. 

Developmental trends and relative differences in the difficulty 
of the list and sentence tests are graphically illustrated in Figure 1. 
In general, the list and sentence tests approximate each other 
so closely with respect to difficulty and dispersion of scores that 
for practical purposes they may be considered identical. 








The Journal of Educational Psychology 


134 














POSN SEM SI[QUIIVA Po}¥IIII09 JO SUBIPY U99MIOG GOUSIOYIP OY} JO 10119 PIVPUBzS 94} Oj BNI OT oe 
‘IUBOTY UVY} ONTVA 91008 JOyIIY B SUY tUvEPY VY} SOVOIPU! O[G¥] OG} UI sONTVA GANVBON » 





























68°2 LE°% oLOr | 12'e 289 id ors 29 '€ 86°9 £62 gs og Be ei es ‘ BUIpwoy-joorg FNoBIqd “% 
"SA BuIpvezy-jooig eBvieay *[ 

zo'2t | 8 ¥ og'6t | 42°8 LT’Ot | 922 pI OT £28 OP IT | FL 2 ££ § BY (ie peed Burpwoy-jooig yNoMiqd “Z 
‘8A Buipvey-jooig Aseq ‘T 

%2'6 | FHS |49°S1 | OH jfel'St | 99'S jzz'or | 19% jog'o9 | Tee 229 | VCR PC " BuIpwey-joorg OBvieay *Z 
‘8A Buipsezy-jooig Aseq ‘| 

srl 6&° at &- wer oa - B's -| OB i~ Bes-i @ - sr" Oo - qdeissivg IyNoOBId ‘Z 
‘SA YAvIBVIGY OBBIeAYy ‘| 

1Z'e-| 16° — |08'S—| WO'I— |€0'6—-| IZ°2— [89 II—| FO F— [IS'F—-| ZVI— | P—-| GOT— po’ qdeiseivg NOMI °% 
‘SA Ydui8eIeg Aseq *T 

Te°g—-| OF T— |IT'e—-| Ve I— |TLZ9—-| SI s- ét’6-i BS- Bee BS - wee - fr’. qdeisvivy OBuieay *Z 
‘SA YduiBeivg Aseq *T 

o0°H'O| e® AW ies HO) A AN ee DOs | ee HO | eA AW ee DOs A Tle WOle AW A 
6 opti 8 Opis Leptin 9 epwin ¢ 9peip ¥ Opti 























SOILVY TVOILINS) ONIGNOdSHUXUO!) AHL GNV SLSA], 
ONIGVAU-AOOUdT AONV HdVUNVUVG AHL AO HOVY NAAMLAG SAHOOG NvAPT NI 


SHONAUAAAIGD—" J] AAV], 











Comparative Study of Different Forms of Spelling Tests 135 


In Figure 2, the mean scores on the dictated paragraph and the 
proof-reading tests are presented graphically in comparison with 
those obtained on the list test. It will be noted that the mean 
scores on the dictated paragraphs are somewhat higher for the 
difficult context paragraphs than for paragraphs of lesser con- 
textual difficulty. Although these differences are not large, the 


60 














S 
oO 











Test Scores 
a) 
@) 














N 
oO 























a 
a fv” m= ammo co List 
e ss ed Easy paragraph 
gor ——cormmme Average paragraph 
S eoccccooees Difficult reading 
10 t ce os cum 86 Easy proof-reading 
———ame Average proof-reading 
| ems Difficult proof-reading 
' Ce ee 
4 5 © T ° " 


Grade Level 
Fic. 2.— Mean scores on the list and paragraph tests by grade level. 
tendency for mean scores to be higher for paragraphs of greater 
contextual difficulty is clearly shown in each grade. That most 
of these differences are statistically significant is shown by the 
critical ratios presented in Table II. 

The tendency for an increase in mean score to accompany an 
increase in difficulty of paragraph context does not hold true for 
the proof-reading tests. The results, instead, show a direct 
relationship between the difficulty of the test as determined by 
mean score and the difficulty of the paragraph context. Thus, in 








136 The Journal of Educational Psychology 


grade IV, the mean score decreases from 25.94 on the easy para- 
graphs to 17.61 on the difficult paragraphs. Likewise, in grade 
V, there is a decrease in mean score from 29.94 on the easy para- 
graphs to 22.70 on the difficult paragraphs. This tendency, 
which is consistent for each of the remaining grades, indicates 
that the ability to detect misspelled words is definitely affected 
by the difficulty of the context in which the words are found. 

A further difference in the results on the two types of tests 
is reflected in the extent of absolute differences in mean scores 
among the easy, average, and difficult context paragraphs. 
These differences together with the corresponding critical ratios 
are presented in Table II. In comparing the easy and average 
dictated paragraphs, it is found that the difference in mean score 
is .69 points in grade IV, .92 points in grade V, 2.58 points in 
grade VI, 2.12 points in grade VII, 1.35 points in grade VIII, 
and 1.30 points in grade IX. It will be noted that this difference 
is at a minimum in grade IV and that it reaches its maximum 
value in grade VI. Beyond grade VI this difference decreases in 
a systematic fashion with increasing grade levels. The same 
trend is likewise operative for the differences between the diffi- 
cult and easy paragraphs and between the average and difficult 
dictated paragraphs. The maximum difference is reached in 
grade VI, and differences below and above this level decrease in 
a systematic order. The extent of the differences among mean 
scores on the easy, average, and difficult dictated paragraphs 
evidently is a function of the composite difficulty of the key 
words. If the paragraphs contain key words, which are of 
average difficulty for pupils taking the test, the differences 
attributable to paragraph context tend to reach a maximum. 
If, on the other hand, the paragraphs contain a preponderance 
of key words that are either too difficult or too easy, the effect 
of contextual differences tends to be reduced. 

A marked contrast to this trend is evident in an analysis of 
absolute differences in mean scores among the proof-reading 
paragraphs. It will be noted that for the most part these 
differences remain fairly constant in all grades except the ninth. 
Furthermore, these differences show a much higher degree of 
statistical significance than that found for the dictated para- 
graphs. There can be no question but that the difficulty of a 
proof-reading (or ‘recognition’) spelling test is much more 











Comparative Study of Different Forms of Spelling Tests 137 


affected by the contextual setting of the paragraph than is the 
dictated paragraph (or ‘recall’) spelling test. 

In Figure 2, it will also be noted that the dictated paragraph 
and the proof-reading tests change markedly with respect to 
their relative difficulty from one grade to the next. In grades 
IV and V the dictated paragraph tests are clearly more difficult 
than the proof-reading or recognition tests. In grade VI, how- 
ever, the mean scores on two of the paragraph tests reach and 
exceed the mean score on the difficult proof-reading test. This 
tendency toward a closer approximation of the mean scores on 
the recall and recognition tests becomes still more evident in 
the seventh, eighth, and ninth grades. Of the six paragraph 
tests only the easy proof-reading form maintains the same rela- 
tive difficulty at all grade levels; this form is consistently the 
least difficult of the six paragraph forms. The list test which 
initially occupies a position intermediate between the paragraph 
and recognition tests becomes relatively less difficult at the upper 
grade levels. In the seventh, eighth, and ninth grades, the list 
test is exceeded in difficulty by all forms but the easy proof- 
reading test. 

It would appear from the nature of the developmental curves 
that this change in relative position results from a difference 
in the abilities measured by the recognition and recall tests. 
Although words are evidently more difficult to spell when pre- 
sented in paragraphs than when presented in list or simple 
sentence form, the increase from one grade to the next is similar 
in all the recall forms of spelling tests. The developmental 
curves for the list and sentence tests, as shown in Figure 1, and 
for the dictated paragraphs and the list test, shown in Figure 2, 
tend to be parallel to each other. Each curve maintains its 
own position relative to the others with the exception of the 
one reversal between the average and difficult paragraphs in 
the ninth grade. Similarly the three forms of the recognition 
spelling test maintain their relative positions with respect to 
each other but they change markedly in relation to the recall 
tests. It is evident that in the fourth grade it is much easier 
to recognize misspelled words than to actually spell the words 
correctly. However, this superiority of the recognition tests 
becomes less and less marked until finally the tendency is 
reversed for the easy and average proof-reading situations. 








138 The Journal of Educational Psychology 


Although the ability to spell as measured by recall tests is 
evidently affected by context, the same ability is evidently 
being measured by all five situations. This does not hold true 
of the recognition test here employed, since the increase in 
ability to detect errors proceeds at a much slower rate than the 
increase in ability to spell words correctly. 

That the two types of test measure somewhat different abili- 
ties is likewise shown by the correlations between scores made 













































































1.00 | 
Pe gt 
aL a ee ae 
0.90 ee emma means 
0.80 oO 
, ee < 
2 0.70 a a ; “ 
3 7" Ps ‘<a N\ 
= 0.60 _ Pa _ 
° io? a % 
oa o a ¥y 
, 
0.50 fA 
>_ - ? 
5 age 
= _— — amare List vs Sentence row 
8 ——ame « « Easy paragraph 
0.30 cumeccanm “ “ Average paragraph —— 
cooceeceee =§6 ~—6« Difficult poragraph 
=——-cssa==mee 8 “ Easy proof-reading 
0.20 — sma Average proof-reading eh 
—eamme 8 « = =6« Difficult proof-reading 
0.10 
0.00 
“+ 5 6 T 8 9 
Grade Level 
Fig. 3.—Correlations between the list test and each of the recall and recognition 
tests. 


by children on the list tests with their scores on the other seven 
tests at each grade level. These correlations are presented in 
Table III and illustrated graphically in Figure 3. It will be 
noted that the list test correlates very highly with the sentence 
and dictated paragraph tests. The correlations which range 
from .83 to .97 make it evident that performance on the list 
test is highly predictive of relative performance on any of the 
recall tests. An examination of the correlations between the 
list and proof-reading tests, however, shows an entirely differ- 
ent picture. The coefficients are noticeably lower than they 


























Comparative Study of Different Forms of Spelling Tests 139 


TaBLeE III.—CorRRELATIONS BETWEEN THE List TEST AND THE RECALL AND 
RECOGNITION Tests aT Eacu GrapE LEVEL 





Grade | Grade | Grade | Grade | Grade | Grade 
IV V VI VII | VIIT| IX 





List vs. Sentence............. .95 .93 . 87 .95 91 91 
List vs. Easy Paragraph...... .88 . 84 .92 .92 .93 .90 
List vs. Average Paragraph....| .92 . 86 .93 . 96 .92 91 
List vs. Difficult Paragraph....} .94 .88 .92 .98 .93 .89 
List vs. Easy Proof-Reading...| .39 44 .60 .73 .75 .55 


List vs. Average Proof-Reading| .45 .46 .70 75 .73 71 
List vs. Difficult Proof-Reading} .52 .61 .70 81 74 .72 























are for the corresponding dictated paragraphs. This is especially 
true in the fourth, fifth, and sixth grades. At these levels the 
proof-reading tests measure a type of spelling ability which is 
much less related to performance on the list test than is spelling 
ability as measured by the recall tests. 

The data pertaining to the proof-reading paragraphs also are 
significant in that the correlation coefficients increase markedly 
in magnitude up to the seventh or eighth grades and then decrease 
to some extent. This decrease is particularly marked for the 
easy proof-reading test. The reduction in the magnitude of the 
correlation coefficients at the ninth grade is quite probably a 
result, in part, of the reduced variability of scores at that level 
(see Table I). The trend for the correlations to increase with 
grade level, however, cannot be accounted for in terms of differ- 
ences in variability. It appears that spelling ability as meas- 
ured by the recognition tests shows some relationship to ability 
as measured by the list test, but this relationship does not 
become really significant until the pupils have attained a fairly 
good mastery of spelling skill. 

In order to isolate some of the more specific factors affecting 
spelling difficulty, an analysis was made of the percentage of 
pupils spelling each of the sixty words correctly on the eight 
tests. The results for three of these words which are presented 
in Tables IV, V, and VI illustrate a number of specific trends 
that remain hidden when only the mean scores on the completed 
tests are considered. 

Of particular interest are the data pertaining to the words in 
List VIII. These words which are the most difficult in the 








140 The Journal of Educational Psychology 


entire series show a characteristic trend which is consistent for 
all pupils in the lower grades. Although an extremely small 
percentage of pupils in these grades spell the words correctly 
on the recall tests, a fairly large proportion of the same pupils 
are able to detect the misspellings of the words in the printed 


TABLE IV.—PERCENTAGE OF PuPiLs IN EacH GRADE SPELLING 
accommodate 
CoRRECTLY ON EAcH OF THE E1GuT TESTS 
(Word List VIII) 


Aver- 
Easy age Diff. 
Aver- Diffi- Proof- Proof- Proof- 


Grade Sen- Easy age cult Read- Read- Read- 
Level List tence Para. Para. Para. ing ing ing 
IV 0 0 0 0 0 38 25 23 
V 0 1 0 0 0 46 22 28 
VI 8 3 4 1 2 42 16 20 
VII 5 7 3 2 + 44 11 17 
VIII 68 #59 39 47 51 64 50 56 
IX 56 52 40 44 50 64 46 58 


TABLE V.—PERCENTAGE OF PUPILS IN EAcH GRADE SPELLING 
laid 
CoRRECTLY ON EACH OF THE EIGHT TESTS 


(Word List IV) 
Aver- 


Easy age Diff. 
Aver- Diffi- Proof- Proof- Proof- 


Grade Sen- Easy age cult Read- Read- Read- 
Level List tence Para. Para. Para. ing ing ing 

IV 32 39 38 35 36 10 14 14 
V 34 8638 48 47 50 12 8 10 


VI 49 # 659 63 71 71 20 16 16 
VII 52 ~ 60 67 68 69 20 23 18 
VIII 79 85 89 86 91 26 30 22 
IX 87 91 90 86 91 42 49 45 





























Comparative Study of Different Forms of Spelling Tests 141 


TABLE VI.—PERCENTAGE OF PuPILs IN EACH GRADE SPELLING 
break 
CoRRECTLY ON EacuH oF THE E1Gut TESTS 
(Word List IV) 


Aver- 

Easy age Diff. 

Aver- Diffi- Proof- Proof- Proof- 

Grade Sen- Easy age cult Read- Read- Read- 
Level List tence Para. Para. Para. ing ing ing 
IV 66 72 65 67 73 48 44 38 
V 54 =69 69 69 73 50 39 37 
VI 71 ~~ §8i1 77 81 78 58 50 43 
VII 69 88 80 78 82 61 52 44 
VIII 78 90 87 90 89 63 60 48 
IX 89 91 92 96 92 76 78 75 


paragraphs. This is especially true of the easy proof-reading 
forms. 

The data for the word ‘accommodate’ in Table IV provide a 
striking illustration of this trend. Although no children in 
the fourth grade were able to spell this word correctly on the 
recall tests, 38 per cent of the same pupils detected the incorrect 
spelling on the easy proof-reading form, 25 per cent on the ‘ aver- 
age’ form, and 23 per cent on the ‘difficult’ form. In grade V, 
the difference is even more marked. At this level, almost half 
of the pupils (46 per cent) were able to detect the misspelling 
of the word on the easy proof-reading form, yet no pupils were 
successful in spelling the word correctly on the dictated para- 
graph and list tests. One per cent of the same group spelled 
the word correctly on the sentence test. In grades VI and VII, 
the differences are almost as striking as those found for grades IV 
and V. However, in grades VIII and IX, the differences are 
markedly reduced, and in a few instances ‘accommodate’ shows 
a smaller percentage of correct responses on the proof-reading 
forms than on some of the recall tests. 

In general this is the pattern found for the other nine words 
in List VIII. In grades IV to VI, the words on the proof-read- 
ing forms are always noticeably easier than on the recall forms. 
In the upper three grades the differences become reduced, and 








142 The Journal of Educational Psychology 


in grades VIII and IX, many of the words show greater diffi- 
culty on the proof-reading forms than on the recall tests. 

A further study of the other lists indicates that as the words 
become less difficult for pupils in a given grade, the differences 
in the percentages of correct responses on the recall and recog- 
nition forms tend to be reduced. As a rule, the words on the 
proof-reading tests tend to be easier than on the corresponding 
dictated paragraph forms. However, many exceptions to this 
trend are found. 

The results pertaining to the word ‘laid’ (Table V), for 
example, show a significantly smaller percentage of correct 
responses on the proof-reading forms than on the recall tests. 
‘In grade IX, there is a decrease in the percentage of correct 
responses from ninety on the easy dictated paragraph to forty- 
two on the easy proof-reading form. In grade VIII, the decrease 
in percentage value from eighty-nine to twenty-six is even more 
striking. Decreases from twenty to fifty per cent are found 
for the remaining grades. A similar trend is noticeable for 
the word ‘break’ in Table VI. In this case, the discrepancy 
in difficulty on the recall and recognition tests is not as great. 
However, ‘break’ shows a more consistent pattern in relation 
to the contextual setting than does ‘laid.’ Except in one instance 
the percentage of correct responses decreases as the proof-read- 
ing context becomes more difficult. 

It appears that where exceptions to the rule do occur, factors 
very specific to a given word may be affecting the results. It 
is very likely that the particular manner in which a word is 
misspelled will, in part, determine the percentage of correct 
responses on the proof-reading forms, for there is no question 
but that some forms of misspelling are more quickly recognized 
than others. It should be emphasized, however, that excep- 
tions to the general rule occur for those words which are not too 
dificult. Words which are very difficult invariably show a 
much higher percentage of correct responses on the recognition 
tests than on the recall tests. 

The evidence supports the conclusion that a child can detect 
misspellings among words he cannot actually spell by virtue of 
the context in which the word appears or because of the appear- 
ance of the word. There are many words, however, which a 
child can spell which he will overlook in a proof-reading situa- 




















Comparative Study of Different Forms of Spelling Tests 143 


tion. In general, when most of the words are very difficult a 
pupil’s score will be higher on the proof-reading tests than on the 
dictated forms, but when the spelling of most of the words is 
known, his score on the proof-reading test will be reduced in 
proportion to the distraction provided by the situation in which 
his responses are determined. 


CONCLUSIONS 


1) Although words presented in sentences are slightly more 
difficult to spell than words in list form, the mean scores and 
the dispersion approximate each other so closely at all grade 
levels that for practical purposes the tests may be considered 
identical. 

2) Words presented in the form of dictated paragraphs are 
more difficult to spell than words in sentences or lists. The 
mean scores on the dictated paragraphs tend to be inversely 
related to the contextual difficulty of the paragraph. 

3) At the earlier grade levels, the proof-reading tests are 
considerably easier than the recall types of spelling test, but 
the difference becomes less marked with each successive grade 
level. In the eighth grade, the average and difficult proof- 
reading paragraphs are more difficult than the recall forms of 
the test. There is a direct relationship between the difficulty 
of the test and the difficulty of the paragraph context in the 
proof-reading situations. The trend which is consistent for 
each grade indicates that the ability to detect misspelled words 
is definitely affected by the difficulty of the context in which 
the words are found. 

4) On the basis of the developmental trends and intercor- 
relations, it may be concluded that the recall and recognition 
tests here employed measure somewhat different abilities. From 
a practical standpoint this means that one type of test cannot 
be substituted for the other as a measure of spelling ability. 
Although the mean scores on the dictated paragraph tests are 
slightly lower than on the list and sentence tests, all of these 
tests are evidently measuring the same ability. 

5) The analysis of individual words indicates that a child 
can detect misspellings among words he cannot actually spell 
by virtue of the context in which the word appears. When 
most of the words are very difficult, a pupil’s score will be higher 








144 The Journal of Educational Psychology 


on the proof-reading tests than on the dictated forms, but when 
the spelling of most of the words is known, his score on the 
proof-reading test will be reduced in proportion to the dis- 
traction provided by the situation in which his responses are 


determined. 


REFERENCES 


(1) Ashbaugh, E. J., The Iowa Spelling Scales. J. of Educ. Res. 
Monographs, No. 3, 1922. 

(2) Buckingham, B. R., Buckingham Extension of the Ayres Spelling 
Scale. Bloomington, Illinois, Public School Publishing Com- 
pany, 1918. 

(3) Gates, A. I., A List of Spelling Difficulties in 3876 Words. New 
York: Bureau of Publications, Teachers College, Columbia Uni- 
versity, 1937, pp. 166. 

(4) Moore, J. E., ‘‘A comparison of four types of spelling tests for 
diagnostic purposes,” J. Exper. Educ., 6: 24-8, 1937. 

(5) Northby, A. S., ‘A comparison of five types of spelling tests for 
diagnostic purposes,” J. Educ. Res., 29: 339-46, 1936. 























THE CONCEPTS OF THE PROFILE, 
PSYCHOGRAPH, AND EVALOGRAPH 


MORSE P. MANSON 


University of Southern California 


No one will deny that a common and precise terminology is a 
sine qua non of any science. It might also be stated that a lack 
of common and precise definitions is symptomatic of scientific 
immaturity. Such immaturity is apparent today in many of the 
applications of psychological techniques. 

Profiles and psychographs have been in use for at least thirty 
years—a sufficient period of time for crystallization of meaning— 
yet an examination of the literature reveals a multiplicity of 
names and definitions, describing and defining fundamentally 
similar devices and concepts. Considerable overlapping and 
uncertainty exists. There appears need for clarification and 
control of the basic concepts. A number of writers have defined 
the profile and the psychograph. The concept of the ‘evalo- 
graph’ is developed here for the first time, although the principles 
involved have been in use for many years. 

Quite early, H. L. Hollingworth and others recognized the 
necessity for the development of sound techniques, uniformity 
and standardization, and further research: 

“The full utility of the psychographic method will be accom- 
plished only when techniques and norms are available... ”’! 

“Efforts should be made to arrive at a uniform system of 
measurement so that psychographs or psychological diagrams 
may be comparable... ’” 

“No standard graph has been agreed upon . . . It is desirable 
to achieve uniformity as soon as possible in order that the psycho- 
graphic study of individuals may be facilitated ... ”8 

More recently, the various graphic devices called profiles, 
psychographs, and evalographs have been used in an increasing 
variety of ways, becoming useful tools to psychologists, psychia- 
trists, educators, psychometricians, military investigators, and 
personnel technicians in government, industry, and business. So 





1 Hollingworth, H. L., Judging Human Character, 1922, pp. 211-212. 
? Claparede, E., Problems and Methods of Vocational Guidance, Inter- 
national Labour Office, October, 1922, p. 77. 
* Hollingworth, L. 8., Special Talents and Defects, 1923, pp. 38-42. 
145 








146 The Journal of Educational Psychology 


useful have profiles, psychographs, and evalographs become, that 
with each initial application of one of these methods, new names 
and definitions blossom forth. (The writer pleads guilty to this 
practice.) For example, one hundred and two different terms 


TABLE I 
No. Name Writer (Source) Year 
1. Ability Pattern D. G. Paterson 1936 
2. Ability Picture M. R. Trabue 1938 
3. Ability Profile M. R. Trabue 1938 
4. Analysis M. S. Viteles 1922 
5. Biotypological Profile M.S. Viteles (Laugier) 1937 
6. Character Profile R. 8. Uhrbrock 1921 
7. Chart H. D. Kitson 1917 
8. Comparative Graph Sheet S. A. Courtis 1914 
9. Composite Profile L. M. Terman 1925 
10. Condensed Graph B. Parker (Rossolimo) 1916 
11. Condensed Psychograph H. C. Stevens (Rossolimo) 1917 
12. Counseling Record D. G. Paterson-C. d’A. Gerken- 
M. E. Hahn 1941 
13. Curve S. A. Courtis 1914 
14. Developmental Profile C. Buehler-H. Hetzer 1935 
15. Diagram W. V. Bingham 1923 
16. Educational Profile T. L. Kelley-G. M. Ruch-L. M. 
Terman 1923 
17. Educational Profile Chart E.B.Greene (Kelley-Ruch- 
Terman) 1941 
18. Evalograph M. P. Manson 1942 
19. Examination H. L. Hollingworth 1922 
20. Gemelli’s Profile A. F. Dodge (Gemelli) 1935 
21. General Profile E. G. Eriksen 1934 
22. Graph S. A. Courtis 1914 
23. Graph Sheet S. A. Courtis 1914 
24. Graphic Device H. L. Hollingworth 1922 
25. Graphic Profile B. J. Dvorak 1935 
26. Graphic Profile Chart F. M. Earle 1931 
27. Graphic Rating E. M. Goldstine 1932 
28. Graphic Record C. Burt 1914 
29. Individual Graph C. H. Tow 1920 
30. Individual Profile G. 5. Seaaitinn 1912 
31. Individual Psychograph UH. L. Hollingworth 1916 
32. Individual Psychological 
Profile C. H. Town 1920 

















The Concepts of the Profile, Psychograph, and Evalograph 


No. 
33. 
34. 
35. 
36. 
37. 
38. 
39. 
40. 
41. 


42. 
43. 


44. 
45. 


46. 
47. 
48. 
49. 
50. 


51. 
52. 
53. 
54. 


55. 
56. 
57. 
58. 
59. 
60. 
61. 
62. 
63. 
64. 
65. 
66. 
67. 


TABLE I.—Continued 


Name 

Individual Record 

Interest Profile 

Interest Psychograph 

Invoice 

Job Psychograph 

Master Profile 

Medical Psychograph 

Mental Profile 

Minnesota Counseling Pro- 
file 

Numerical Profile 

Occupational Ability Pat- 
tern 

Occupational Ability Pro- 
file 

Occupational Aptitude Pro- 
file 

Occupational Profile 

Occupational Psychograph 

P—index of psychograph 

Pattern 

Pattern of Abilities and 

Knowledges 

Personality Graph 

Personality Picture 

Personality Profile 

Personnel Evaluation Sum- 
mary 

Pictorial Representation 

Professeogram 

Profile 

Profile Chart 

Profile Graph 

Profile Line 

Profile of Abilities 

Profile of the Individual 

Provisional Profile 

Psychic Profile 

Psychogram 

Psychograph 

Psychograph for Individual 
Children 


Writer 
S. A. Courtis 


(Source) 


, Mieenen 
. Cobb-R. M. Yerkes 
. Town 


D. G. Paterson-C. d’A. Gerken- 


M. E. Hahn 
L. M. Terman 


M. R. Trabue 
A. F. Dodge 


D. Fryer 
M. 8S. Viteles 
E. Claparede 
B. Parker 
J. E. Downey 


(Rossolimo) 


D. G. Ryan 

H. Allport-G. W. Allport 
L. Hollingworth 
W. Allport 
ig 


F. 
H. 
G. 
M. P. Manson 
W. Stern 

M.S. Viteles (Gusev) 
G. 8. Rossolimo 

L. M. Terman 

H. R. DeSilva 

W. Stern 

L. W. Terman 
H. L. Hollingworth 
M. S. Viteles 

C. H. Town 

A. A. Roback 

H. L. Hollingworth 
C.B 


urt 





147 


Year 
1914 
1931 
1920 
1922 
1922 
1942 
1921 
1920 


1941 
1925 


1933 


1935 


1931 
1937 
1922 
1916 
1920 


1941 
1921 
1922 
1922 


1942 
1914 
1933 
1912 
1925 
1938 
1914 
1925 
1929 
1937 
1920 
1927 
1916 


1917 





148 


No. 


68. 
69. 
70. 
71. 
72. 
73. 
74. 
75. 
76. 


77. 
78. 
79. 
80. 
81. 
82. 
83. 
84. 


85. 
86. 


87. 
88. 
89. 


90. 
91. 


92. 
93. 
94. 
95. 


96. 
97. 
98. 
99. 
100. 
101. 
102. 


The Journal of Educational Psychology 


TABLE I.—Continued 


Name 
Psychographic Chart 
Psychographic Profile 
Psychographic Schedule 
Psychological Diagram 
Psychological Graph 
Psychological Profile 
Psycho-physical Profile 
Psychovocational Graph 
Psychovocational Inven- 

tory 
Quantitative Picture 
Record 
Record Card 
Record of Relations 
Report Card 
Rossolimo’s Graph 
Rossolimo’s Profile 


Route Schedule for Norma- 


tive Examination 
Scholastic Psychograph 
Selectograph 


Summary Profile 
Summary Record 


Summary Record of Indi- 


vidual Diagnosis 
Talent Chart 


Teacher Examination Pro- 


file 
Temperamental Pattern 
Test Profile 
Trait Profile 


Visual Performance Profile 


Visual Profile 

Viteles’ Job Psychograph 

Vocational Profile 

Vocational Psychograph 

Volitional Pattern 

Will Profile 

Worker Characteristics 
Form 


Writer (Source) 
A. A. Roback 
H. L. Hollingworth 
F. M. Earle 
E. Claparede 
E. Claparede 
W. V. Bingham 
.F. Dodge (Gemelli) 
. Claparede 


. Claparede 

. L. Hollingworth 
. E. Seashore 

. W. Kallom 

. L. Rogers 

. G. Ryan 

. H. Town 

. H. Town 


. Gesell-H. Thompson 
. Burt 


Q> QOUPPORE BP: 


(Bausch and 
Lomb Opt. Co.) 


a) 
4 
ER 
5 


M. R. Trabue 
E. G. Williamson 


G. Eriksen 
. Seashore 


E 

G. Ryan 

E. Downey 

G. Williamson 

C. Warren 

Tiffin (Bausch and 
Lomb Opt. Co.) 


E. 
C. 
D. 
J. 

E. 
H. 
J. 


. Tiffin 
. F. Dodge _(Viteles) 
. H. Martin 

. L. Hollingworth 
. E. Downey 

. E. Downey 

8. 


SS ft > > 


R. 8. Ward 





Year 
1927 
1929 
1931 
1922 
1922 
1920 
1935 
1922 


1922 
1929 
1912 
1919 
1919 
1941 
1920 
1920 


1938 
1937 


1942 
1933 
1934 


1934 
1919 


1941 
1925 
1934 
1934 


1942 
1942 
1935 
1935 
1916 
1920 
1920 


1940 

















| 








The Concepts of the Profile, Psychograph, and Evalograph 149 


used to describe graphic devices possessing the characteristics of 
profiles, psychographs, and evalographs are listed in Table I. 
This is by no means a complete or final list. The dates given 
here do not indicate priority of use, necessarily, but only dates of 
publication as noted by the writer. 


THE CONCEPT OF THE PROFILE 


There seem to be three tendencies, at least, which obfuscate 
a precise usage of the term ‘profile.’ The first is the common 
practice of using profile and psychograph synonymously. The 
second is to use the term ‘profile’ as a wastebasket and dump into 
it all manner of graphic and literary descriptions. The third is a 
delimitation prescribing its use to narrow interpretations. 

How has this concept been treated by earlier writers? 

“ ... the graph resulting from . . . a charting of measures 

. of an individual in some array of tests or traits.’’! 

“ . .. not, strictly speaking, a method of obtaining a total 
score, but presents graphically an individual’s scores in several 
tests in relation to each other. A curve is plotted showing his 
relative standing in the various tests.’’? 

“« . . . equals a schematic outline of the characteristic mental 
traits of an individual in so far as they can be determined quanti- 
tatively and presented in graphic form . . . (syn: psychograph) 

3 

““ . . . synonymous with ‘psychogram’—a description of the 
mental life of an individual, a mental representation . . . syn- 
onymous with ‘psychograph’—a psychological biography or 
analysis of an individual . . . a ‘graph’ or ‘curve’. . . a group 
of data representing quantitatively the extent to which an indi- 
vidual exhibits certain traits or abilities as determined by tests or 
raings and usually represented in the form of a graph.’’* 

“Individual psychograph or profile . . . The method usually 
involves a graphic representation of each score made on a battery 





1 Hollingworth, H. L., Judging Human Character, 1922, p. 201. 

* Bingham, W. V. and Freyd, M., Procedures in Employment Psychology, 
1926, pp. 211-214. 

* Warren, H. C., Dictionary of Psychology, 1934. 

* Webster’s New International Dictonary. 





*s. Cae, 





150 The Journal of Educational Psychology 


of tests. The scores on different tests are expressed in mental 
age units, percentile values, or sigma units.’’! 

‘A profile has been variously defined as a line diagram which 
indicates the relative position of an individual or group in each 
of several traits, thus bringing into relief divergent standings on 
various traits, as well as the general tenor of the scores or ratings; 
as a curve uniting the successive points on a graph representing 
an individual’s status in each of several traits; and as a method of 
recording the status of an individual on each of several traits so 
that the geometric pattern so created may be meaningful.’’? 

In this treatment of the concept, the profile is not a psycho- 
graph. The concepts of the profile and the psychograph can be 
clearly differentiated and used independently. Nor is the profile 
a description of the mental life of an individual. It may repre- 
sent data which measure or describe ‘ phases’ of the mental life of 
an individual. It is for this same reason that the profile should 
not be called a ‘psychological profile,’ since this term is too all- 
inclusive, also. The term ‘profile’ should be limited to the data 
it represents. 

There is no predictive power inherent in the profile. Its 
interpretation, as in the interpretation of a temperature chart, 
depends on the skill of the interpreter. The use of the profile is 
not limited to individual diagnosis. Its use can be applied to 
groups, as well as to multi-factor situations. Nor is the profile 
limited to mental testing, personality testing, aptitude testing, 
or any single group of tests. A profile can be constructed from 
data of many sources, as, for example, measures of physical 
strength, clerical ability, typing speed, and background in his- 
tory, if such a combination of factors were considered significant 
in a certain situation. 

The profile need not be constructed with fixed statistical units, 
such as percentile values, sigma units, or mental level scores. 
Any units, objective or subjective, if convertible into com- 
parable scores and practical can be used. The profile, in itself, 
is not a measure of relationships. 





1 Dvorak, B. J., Differential Occupational Ability Patterns, Employment 
Stabilization Research Institute, University of Minnesota Press, V. 11, 


No. 8, p. 309. 
2 Edgerton, E. A., Bordin, E., Molish, H., ‘‘Some Statistical Aspects of 


Profile Records,” J. Ed. Psych., V. 32, 1941, p. 185. 














The Concepts of the Profile, Psychograph, and Evalograph 151 


What is the profile? The profile is constructed in graphic 
form, and it may be a curve or aline. This curve or line repre- 
sents many measurements and/or estimates, either of individuals 
or of groups of individuals or of complexes of factors. The 
scores translated into the curve should be expressed in identical 
terms. 

It can be stated that a profile is a graphic curve or line 
constructed from many individual or group measurements 
and/or estimates expressed in identical terms. 

In brief, the profile is the curve, nothing more and nothing 
less. This curve is a graphic representation of many factors. 
With supplementary information, this curve will allow the various 
relationships which may be present to be easily grasped and 
understood. Complexity is the result of numerous intermingled 
factors or variables. Once these variables can be isolated, 
measured, controlled, and presented together in simple graphic 
form, considerable insight often may be achieved. Com- 
plexities exist in situations as well as in individuals or groups. 
Thus the profile may be used in attempting an understanding of 
situations outside, as well as within, the individual. 


THE CONCEPT OF THE PSYCHOGRAPH 


The psychograph is considered by most writers to be identical 
with the profile. This is confusing. The psychograph, as 
H. L. Hollingworth has recognized, is more than the profile. It 
is an expression of a method of analysis, more comprehensive and 
penetrating than the profile. 

Some of the earlier concepts were: 

“The psychographs are constructed by plotting the score of 
each test at the appropriate point on the ordinate, the abscissa 
of which is the function tested.’”! 

““Psychography may be defined as the science of making 
graphic records of mental traits.’’? 

“Each specific mental ability which is required by the job 
is checked with an X. The X is placed in one of the columns 
marked 1-2-3-4-5 with reference to the degree to which it is 





1 Stevens, H. C., ‘‘A Revision of the Rossolimo Tests,” Studies in Psychol- 
ogy, Titchener Commemorative Volume, 1917, p. 130. 
2 Uhrbrock, R. S., ‘‘ Vocational Psychographs,”’ Education, V. x1, No. 8, 


April, 1921, p. 510. 








152 The Journal of Educational Psychology 


significant on the job... A line connecting the X’s on the 
chart gives what might be called a ‘job psychograph’’’.! 

“In more recent developments the term ‘psychograph’ has 
been used especially to designate a particular analytic and graphic 
method of exhibiting the measures of an individual in some 
array of tests or traits.’’* 

“The job psychographic method is based upon a subjective 
analysis of the mental requirements of a job. This has been 
supplemented by the more objective techniques of describing 
job qualifications in terms of aptitude tests used in predicting 
proficiency on the job.’”* 

‘“‘Psychograph: [1] a chart used by personality investigators 
to indicate an individual’s measure in the fundamental per- 
sonality traits, these traits being placed at equal distances, either 
in a row along the abscissa, or in a column along the ordinate 
axis, the values obtained for each being marked at the appro- 
priate point in the other axis, the chart is completed by connect- 
ing these points by lines so as to form a psychic or trait profile. 
[2] a descriptive account of an individual’s mental functions 
(i.e., attention, memory, perception, etc.) treated differentially 
and functionally. (W. Stern). [38] (loosely) a record in literary 
form of an individual’s traits and responsive behavior, as revealed 
by a series of laboratory experiments and tests. (Toulouse, 
Binet). Syn: psychic or mental profile, psychogram (rarely 
used. )’’4 

““Psychography . .. sometimes synonymous with literary 
biography, sometimes with any random listing of disparate 
psychological facts concerning a given person. Also may desig- 
nate an orderly case study or life history from the clinical point of 
view . . . as used here (By G. W. Allport) means simply a 
printed graph or profile upon which is plotted the actual magni- 
tude of common traits attained by any individual.’’® 





1M. S. Viteles, ‘‘Job Specifications and Diagnostic Tests of Job Compe- 
tency Designed for the Auditing Division of a Street Railway Company,” 
Psych. Clinic, V. 14, 1922-1923, pp. 103-105. 

? Hollingworth, H. L., Judging Human Character, 1922, p. 201. 

3 Otis, J. L., and Smuts, K. R., “‘The Job Psychograph in Job Analysis,”’ 
Occupations, June, 1934, p. 54. 

4 Warren, H. C., Dictionary of Psychology, 1934. 

5 Allport, G. W., Personality, A Psychological Interpretation, 1937, p. 
402. 














The Concepts of the Profile, Psychograph, and Evalograph 153 


Transposing, it can be said, the psychograph is not a profile; 
and, like the profile, the psychograph—as developed in this con- 
cept—is not a description of the mental life of an individual or a 
psychological biography. Nor is the psychograph, as an instru- 
ment, more predictive than its interpreter. And it is clear that 
the psychograph need not be restricted to individual analysis, 
nor to any limited phase of testing, nor to any fixed statistical 
units of measurement. 

The psychograph, however, should include a _ profile—as 
defined earlier—among other things. Also included should be a 
condensed analytical record containing numerical and descriptive 
data significantly related to the subject under consideration. 
All this should be presented in a form that is conveniently pre- 
sented. 

A psychograph is a condensed analytical record of a complex 
of factors, containing a profile and related numerical and descrip- 
tive data. In other words, the psychograph is the recorded 
expression of many measurements, concisely presented in graphic, 
numerical, and descriptive form. 


THE CONCEPT OF THE EVALOGRAPH 


The name ‘evalograph’ was brought forth when it was realized 
that adequate interpretations of many scores or ratings of the 
same individual required considerable thought and appraisal. 
It called for the weighing of performances, and, most important 
of all, the evaluation and prediction of the behavior of the total 
individual in an actual situation. This process of evaluation—so 
vague and yet so insistent—required expression in the record, 
hence the term ‘evalograph.’ 

The evalograph provides for the use of substantially more 
information than does the psychograph. Personal data such as 
work history, educational record, appearance, condition of 
health, special qualifications, etc., as well as test results may be 
recorded on the evalograph. This approach, of course, is not 
novel. 

‘« |. . measures must be set in a full survey by systematic 
observations and other verified information bearing upon the 
valuation of the individual as a singer.’’! 





1 Seashore, C. E., ‘‘The Measure of a Singer,’’ Science, Feb. 9, 1912, p. 
202. 








154 The Journal of Educational Psychology 


“« ... any mental, physical or environmental factors which 
may be relevant.’”? 

“The general plan is, therefore, to interpret the curve in the 
light of all the knowledge of the individual that can be obtained, 
and to check conclusions reached by further tests at the first 
opportunity.’’? 

The job psychographs of Viteles as used by Wood and Cades 
contained supplementary information, as, the amount of training 
required, opportunities for promotion, advantages and dis- 
advantages, allied jobs, and remarks. The many studies of 
Trabue, Paterson, et al, in the Employment Stabilization 
Research Institute program made use of considerable information 
obtained from many sources. 

Should a great number of evalographs be used, as may be the 
case in some organizations, the use of slotted or holed cards is 
suggested. Quick sorting and classifying can then be accom- 
plished. Coding systems may provide for the recording of con- 
siderable information. The Selectograph of the Bausch and 
Lomb Optical Company, Rochester, N. Y., is such an evalograph. 

The interpretation of the data on the evalograph calls for 
sound clinical techniques. Recommendations, which are 
recorded on the evalograph, should be made after careful con- 
sideration of all factors. If necessary recommendations may 
require the pooled estimates of more than one judge. Trabue, 
Paterson, et al, used panels of experts to help diagnose their 
cases of unemployed men and women. The importance of 
health, attitudes, interests, and appearance must be properly 
evaluated and recorded, if these factors are included in the 
evalograph. Recommendations made may call for extensive 
programs of training, re-training, or rehabilitation, and such 
recommendations together with essential comments may be 
recorded on the evalograph. Rossolimo suggested the use of 
similar practices in the interpretation and application of his 
psychographs. 

“The profile of a single individual may be studied for its 
varying mental qualities and perhaps some vocational guidance 
might be worked out by determining the best traits of the person 
and their possible combination in some type of employment 





1 Seashore, C. E., The Psychology of Musical Talent, 1919, pp. 15-29. 
2 Courtis, S. A., The Courtis Standard Tests, 1914, pp. 45-49. 














The Concepts of the Profile, Psychograph, and Evalograph 155 


. . . The developmental levels of the same individual might be 
studied through a series of years... 7”! 

The results of interviews, the interests and beliefs of the sub- 
ject, the position taken by the parents, the estimates of teachers, 
the remarks of employers and supervisors, if of value, can and 
should be recorded. Earle used just such information in his 
vocational guidance studies.? Skilled individual diagnoses and 
recommended therapeutic measures, whether of physical, men- 
tal, emotional, or vocational nature, are to be recorded on the 
evalograph. 

Another characteristic of the evalograph is the use of ‘follow- 
up’ or control techniques. The number and variety of these is 
unlimited. Courtis suggested the use of re-tests. Rossolimo 
wanted to follow his cases through over a period of years. Earle 
used employment records to check on the reliability and validity 
of his methods. Manson used records of present and future 
work performance, supervisor’s ratings, estimates of over-all 
adjustment to the work situation, and follow-up interviews to 
check on work interest and satisfaction.? The Bausch and 
Lomb Optical Company use a punched-card system which has 
possibilities of manipulation in many control directions. In 
addition, they provide for an ‘eye history.’ 

A list of control devices which have been used on evalographs 
would include the following: 


1. Further testing 4. Ratings 
a. New tests 5. Adjustment to work 
b. Re-tests 6. Earnings 
2. Employment records 7. Punched card system 
3. Work performance 8. Follow-up interviews 


The use and recording of considerably more information, the 
application of sound clinical techniques, and the development of 
control devices characterize the evalograph. The evalograph, 
then, is a psychograph with additional interpretive and control 
data, requiring clinical techniques to use most effectively. 





1 Parker, B., ‘‘The Psychograph of Rossolimo,’’ Am. J. of Insanity, V. 73, 
October, 1916, pp. 273-293. 

? Earle, F. M., Methods of Choosing a Career, 1931. 

’ Manson, M. P., Narrative Report of Personnel Evaluation Section, Federal 
Works Agency, Work Projects Administration, Southern California, October 
1942, pp. 20-22. 





156 The Journal of Educational Psychology 


CONCLUSIONS 


1) The increasing use of profiles, psychographs, and evalo- 
graphs has resulted in an expanding and over-lapping termi- 
nology. The clarification of the basic concepts seemed necessary. 

2) Such a clarification was attempted in this paper, with the 
following definitions as the result: 


(a) A ‘profile’ is a graphic curve or line constructed from 
many individual or group measurements and/or estimates 
expressed in identical terms. 

(b) A ‘psychograph’ is a condensed analytical record of 
a complex of factors, containing a profile and related numeri- 
cal and descriptive data. 

(c) An ‘evalograph’ is a psychograph with additional 
interpretive and control data, requiring clinical techniques 
to use most effectively. 


3) The concept of the evalograph is developed here for the 
first time. 

















CONFUSION IN EDUCATIONAL MEASUREMENT 
W. A. SAUCIER 


Baker University 


Since the modern objective test was devised about twenty- 
five years ago, its promoters have expressed opinions freely about 
its function as well as its structure. In doing this, they have 
often presented views that are inadequate, contradictory, and 
untenable. As is true of political propaganda, these ideas have 
been repeated so frequently that they have been commonly 
accepted by many unsuspecting students of education. Accord- 
ingly, a careful and constructive reexamination of some of these 
unsupported points of view is an urgent necessity. 

Confusion in educational measurement has resulted largely 
from the almost universal failure of leaders to express explicitly 
their psychology and philosophy of education in their discussion 
of the examination. One writer, for example, goes so far as to 
state that measurement is not interested in the schools of psy- 
chology or their varying interpretations.' How can he and 
others evaluate the types of examinations without continuous 
reference to the purpose and process of education? How can 
the objectives and means of education be determined apart from 
psychology and philosophy of education? Outside the field of 
educational measurement it is commonly recognized that it is 
impossible to evaluate any instrument, or tool, without consider- 
ing the specific purpose of its construction. This principle has 
been violated, however, as some of the theorists have endeavored 
to determine the advantages and disadvantages of types of 
examinations. 

Explicit and continuous application of educational psychology 
and philosophy to measurement would result in definite and 
adequate references to what is measured as well as to the tech- 
nique of measurement. A glance at typical written discussions 
of the examination reveals a monotonous recurrence of the phrase 
‘measuring achievement.’ It is unfortunate that some leading 
writers use this expression recklessly, failing to commit them- 
selves on the kind of achievement they may have in mind. 
There can be achievement in either democratic or undemocratic 





1E. W. Tiegs, Tests and Measurements in the Improvement of Learning, 
Boston: Houghton Mifflin Company, 1939, p. 350. 
157 





158 The Journal of Educational Psychology 


living, in either relating ideas or acquiring them separately, in 
either intelligent discussion or the reproduction of isolated facts, 
and in either thinking or making automatic responses. Theorists, 
in thus repeatedly referring to ‘measuring achievement,’ without 
adding a word to denote the particular kind of achievement they 
aim to measure, are apparently unwilling to express explicitly 
their point of view about it. However, by implication at least, 
they reveal that they consider achievement solely as the acquisi- 
tion of bits of subject-matter. This indicates that they belong 
to the atomistic school of psychology.’ 

Such a narrow view of education and educational measurement 
has been shown further by some of these leaders as they have 
restricted the application of the concept of validity. There is 
consensus that a test is valid if it measures what it is designed to 
measure. Furthermore, all seem to agree that validity consti- 
tutes the most important criterion for judging the value of an 
examination. Yet in applying this criterion, some writers have 
confined its use to determining the degree to which the test 
measures knowledge of the material of the course.2_ Apparently 
they believe that a test should not be expected to reveal achieve- 
ment in interests, attitudes, ideals, habits of study, habits of 
thinking, and creative activity. Otherwise they would insist 
that a test must show the degree of development in these out- 
comes, if it is to be considered as valid. 

The organismic psychology of learning and the democratic 
philosophy of education suggest learning involving insight, broad 
comprehension, scientific attitudes and habits, problem solving, 
logical organization of ideas, and creative activity. The teacher 
who really aims at the promotion of this type of learning directs 
his procedure persistently toward these related ends. Moreover, 
he does not teach according to one set of objectives and test 
according to another set. He insists on an examination that 
affords the pupil ample opportunities to show his ability to 
compose, to create, to engage in extensive discussion, and to 





1 That the typical writer on measurement discusses the subject in isola- 
tion, without reference to psychology and philosophy or the whole program 
of education, suggests further that most of them are atomistic psychologists 
in their thinking in this area. 

2 See, for example, H. A. Greene, A. N. Jorgensen, and J. R. Gerberich, 
Measurement and Evaluation in the Secondary School, New York: Longmans, 
Green and Company, 1943, pp. 52-61. 








Confusion in Educational Measurement 159 


organize his thoughts logically. In other words, he believes 
that only as an examination has as its primary purpose the 
determination of development in these major outcomes of a 
democratic program of education can it be considered to be 
a valid test. 

In contrast, the popular atomistic psychology of Thorndike! 
and traditional theory in education support objectives and teach- 
ing procedure that emphasize learning by parts, or the acquisi- 
tion of elements of subject-matter. Such an educational program 
requires an examination that provides for an extensive ‘sampling’ 
of specific items of the course. Since application of facts, recon- 
struction of thoughts, broad comprehension, insight, solving prob- 
lems, extended discussion, and logical self-expression are not of 
primary importance in the objectives and teaching procedures 
of the atomistic psychologist, he can consistently consider them of 
only secondary importance in his program of measurement. 
Thus a strictly factual examination possesses high validity for an 
educational program that consists primarily of the memorization 
of isolated facts. 

The obvious logical conclusion is that validity in measurement, 
as elsewhere, is never general, but always specific. No instru- 
ment, or tool, is useful in the abstract, but always for a particular 
purpose. The new-type examination,’ or objective test, has high 
validity in traditional education, which is supported by atomistic 
psychology. On the other hand, the essay, or discussion, 
examination has superior validity for measurement in the modern 
program of education, which is based on the organismic theory of 
learning. 

The specificness of validity in educational measurement has 
usually been recognized by leading writers. Ross, for example, 





1 Incidentally, it is significant to note that authorities seem to agree on 
crediting Thorndike and his followers with developing and popularizing 
the modern objective test. See, for example, C. C. Ross, Measurement in 
Today’s Schools, New York: Prentice-Hall, 1941, p. 48. 

2 It is important for the student of educational measurement to recognize 
the rather obvious fact that the new-type examination is new only in form. 
Since it is essentially an examination on isolated details of subject-matter, 
it is in this respect as old as the written examination. A typical example 
of the new-type examination is: ‘‘America was discovered by a 





This same item in old form, such as was common in the days of our grand- 
fathers, would read: ‘‘ Who discovered America?”’ 





160 The Journal of Educational Psychology 


says that ‘‘validity is always specific,”’ that general validity does 
not exist. Yet he disregards this important principle as he 
contradicts himself in making the general statement that a 
‘serious limitation” of the essay examination is that it has “‘low 
validity.”” In making this unlimited assertion he does not refer 
at all to the specific purposes, or uses, of this examination. If he 
had done this he would have pointed out that the objective test 
has high validity for measuring the elements of subject-matter, 
but that the essay examination has high validity for revealing 
integrated learning, the reorganization of thoughts, broad com- 
prehension, and creative expression. These two contradictory 
statements by Ross are typical examples of the many conflicting 
assertions made by leading theorists in educational measurement. 

Equally misleading has been the ordinary discussion of the 
difference in reliability between the objective test and the essay 
examination. Most of the writers have made considerable effort 
to furnish proof that the essay examination is so unreliable that 
it is of little or no value in educational measurement. The 
studies on the unreliability of the essay examination have revealed 
nothing essentially new or surprising. The subjectivity of 
grading answers involving extensive discussion in contrast to 
the objectivity of grading those involving only one word was, 
of course, recognized by many public-school teachers before 
there were studies of the unreliability of the discussion, or essay 
examination. Any teacher could see this. The ‘scientific’ 
studies merely demonstrated further what had already been 
evident to the observing teacher. Moreover, an objective 
interpretation of these studies reveals that they do not support 
the conclusions usually drawn by writers using them in an effort 
to prove the unreliability of the essay examination. Greene, 
Jorgensen, and Gerberich are among the few writers who hold 
that “‘educators for various reasons somewhat misinterpreted the 
findings of these and many subsequent studies of the traditional 
examination.’’? 

To comprehend the misrepresentation of the findings of these 
studies, one might well view them as they are presented in table 
form by Tiegs along with his interpretation of them.* Tiegs, as 


1 Tbid., p. 74. 
2 Greene, Jorgensen, and Gerberich, op. cit., p. 133. 
* Tiegs, op. cit., pp. 8-11. 











Confusion in Educational Measurement 161 


a typical theorist in measurement, points out that grades of each 
essay examination in these studies vary by forty to seventy 
points. Observation of the lists of scores in the several studies, 
however, reveals that only a few scores show such a wide vari- 
ation. Instead, the large majority of them in each test tend to 
cluster. For some reason, Tiegs and others refrain from pointing 
out this significant fact. Further, it should be noted that some, 
if not all, of these studies involve unrealistic situations. For 
instance, several teachers graded only one paper; whereas in 
ordinary situations one teacher grades several papers and thus 
is likely to increase the reliability of the papers as he compares 
them one with another. In like manner, these writers strangely 
neglect to point out that the graders of the examinations in 
most of the studies did not construct the examinations they 
graded. Since, however, the classroom teacher constructs his 
own essay-examination questions based on what he has been 
teaching, he is thereby aided in approaching reliability in the 
grading of the papers. 

The sound conclusion is that although the subjectivity of the 
discussion examination makes it somewhat unreliable, studies 
do not show that it is as unreliable as certain writers assert that 
it is. Indeed, some studies tend to show positively the possi- 
bility of a rather high degree of reliability in this examination.' 
The conclusion from one such study is that the reading of essay 
examinations by capable readers who have set up standards of 
working is sufficiently reliable ‘“‘to call into question the often- 
repeated statement that essay examinations cannot be scored 
reliably.’’? 

The point of importance is that in practical situations teachers 
have always been faced with the problem of choosing between, 
on the one hand, the high validity but questionable reliability of 
the discussion examination for measuring the outcomes of free 
and creative self-expression, logical organization of ideas, and 
critical thinking; and, on the other hand, the high validity and 
obvious reliability of the objective test for measuring the acquisi- 
tion of bits of subject-matter or isolated facts. Leaders in 





1 See, for example, F. E. Bolton, ‘‘Do Teachers’ Marks Vary As Much As 
Supposed?’”’ Education, (September, 1927), xiv, pp. 23-39. 

2 A. E. Traxler and H. A. Anderson, ‘‘The Reliability of an Essay Test in 
English,” School Review, (September, 1935), xu1u, p. 538. 








162 The Journal of Educational Psychology 


measurement tend to agree that reliability is secondary to 
validity. Hence, the issue before the classroom teacher should 
largely be the selection of the examination that is best for the 
measurement of the ends of education that he prizes most highly. 

Some writers in measurement have claimed that in recent 
years objective tests for measuring elements of thinking have 
been designed. Careful observation by an unbiased investigator 
can be expected to reveal that the effectiveness of such tests has 
not been demonstrated. Segel, who believes that these tests 
function in the measurement of reasoning, submitted as an illus- 
tration the following question in general science: 


Spaces are left between the rail lengths of railroad tracks, (a) to allow 
for differences in load, (b) to allow for differences in air pressure, (c) to 
allow for differences in temperature, (d) so that they can be fastened 
together easily, (e) to stop electricity from passing along the rails." 


It should be obvious that there is no essential difference 
between this multiple choice question and those commonly con- 
structed and usually accepted as being factual. The correct 
answer is an isolated fact that can be memorized as such from 
the text. Besides, the incorrect answers are so absurd that the 
right answer stands alone as the only possible one. To say 
the least, the pupil is not required to express his own thoughts; 
he needs only to check another person’s thoughts. 

Likewise, Wrightstone reported his attempt to construct 
objective tests of reasoning, or thinking, in the social studies.? 
These tests consist of a series of paragraphs each of which is 
followed by about six short multiple-choice items that are sup- 
posed to test the ability of the pupil to recognize correct inter- 
pretations, conclusions, or generalizations, of the paragraph. 
In an effort to show the value of such tests for discovering these 
mental functions, he correlated the scores of one of them with 
those of one in which pupils were required to state their own 
interpretations of the same series of paragraphs. Finding the 
significantly high correlation of .86, he concluded that the latter 
of the two tests could be substituted for the former. 





1 David Segel, ‘‘Some Newer Practices in Evaluation,” School Life, 


(June, 1941) xxvi, pp. 269-270. 
2 J. W. Wrightstone, ‘‘Measuring Some Major Objectives of the Social 


Studies,” School Review, (December, 1935), xL1m, pp. 771-779. 








Confusion in Educational Measurement 163 


One can accept this conclusion without recommending the 
use of either one of the two tests for the determination of ability 
to draw conclusions, or to make generalizations. The high 
correlation of these lists may show that they are equally poor 
or equally good. The fact is that in each of them the pupil is 
furnished a few facts in a series of brief paragraphs which he is 
required to interpret. In neither case is it necessary for him to 
possess a mass of facts from which he would need to select the 
ones necessary to answer a thought-provoking question such as 
is found in a good essay examination. 

Similarly, in an attempt to illustrate test items that measure 
results other than factual learning, Tiegs gives a true-false item 
that he claims can be used to reveal interest and appreciation. 
It is: ‘A symphony concert is more interesting than a movie.” 
As an example of an item that he believes to be valuable for the 
determination of an ideal or an attitude, he submits the true- 
false statement: “It is better to lose a game than to win by 
playing ‘dirty’.’”' It is strange that Tiegs does not realize 
that, since a pupil is almost certain to see such specific affirma- 
tions in a textbook or to hear his teacher make them, he is very 
likely to mark each of them as true, whatever his reaction to 
either statement may be. Under such circumstances, he would 
indicate that he prefers a symphony to a movie even though 
he has never even attended a symphony; and that he is for 
playing a ‘clean’ game, whatever his true attitude may be. 

There is little justification, therefore, for some recent claims 
that educators now have objective tests that can be substituted 
for the general discussion examination, in the effort to measure 
such higher mental functions as are involved in solving complex 
problems, logical organization of thoughts, creative self-expres- 
sion, presentation of a comprehensive point of view, and scien- 
tific thinking. If one could succeed in measuring each element 
of any one of the preceding complex activities, one would not 
thus discover the nature of the activity as a whole. In measure- 
ment, as elsewhere, one must reckon with the established fact 
that the whole is more than the sum of its parts. 

There are writers who admit that the discussion examination 
has as its special function the determination of progress in the 





1 Tiegs, op. cif., pp, 55-56, 








164 The Journal of Educational Psychology 


expression of relationships, the reorganization of thoughts, and 
the application of knowledge; but they discount the value of 
this examination because it involves composition. What they 
apparently fail to see is that although composition consists of 
more than the mechanics of written expression, capitalization, 
punctuation, the use of parts of speech, and the like—, these 
are essential in the reconstruction and expression of thoughts. 
Thus composition may be a non-essential or even a hindering 
factor in an examination that measures elements of learning 
objectively, but it is a necessity in an examination that measures 
effectively integration, or thinking, in learning. The sound 
conclusion is that, contrary to the beliefs of some writers in 
measurement, unless an examination involves composition, it is 
more valuable for the measurement of isolated facts than for 
the more important outcomes of relating and applying the facts. 
The same theorists show further confusion in their thinking 
as they refer to the extent to which the two kinds of examinations 
involve ‘sampling’ of learning. One reason that they favor the 
objective test is that it provides a more extensive and thorough 
sampling of learning or achievement.! Here, as elsewhere, they 
do not point out the nature of the achievement that they have 
in mind. Is it achievement solely in the acquisition of facts; 
or these facts along with the development of creativeness, 
scientific attitude, democratic habits and ideals, and the like? 
Some of them give as an analogy the practice of sampling 
wheat from several points in a bin, to obtain accurate information 
about the kind of wheat in it. What these writers seemingly 
fail to grasp is that, according to modern psychology, both the 
pupil and effective learning are more than an aggregation of 
elements such as constitute a bin of wheat. The teacher does 
not discover integration in learning or thinking through sampling 
isolated elements of learning. Since the objective test does not 
involve the reorganization of thoughts, critical self-expression, 
free discussion, and the like, its sampling of learning, or achieve- 
ment, is clearly restricted to the acquisition of the elements of 
subject-matter. We conclude that, in the light of all the objec- 
tives of education, especially those stressed in a democratic 
program, the objective test does not provide for as broad a sam- 
pling of learning, or achievement, as the discussion examination. 





1 See, for example: Greene, Jorgensen, and Gerberich, op. cit., pp. 155-157, 














Confusion in Educational Measurement 165 


In this connection it should be pointed out that the most 
valuable discussion-examination question involves a compre- 
hensive problem, requiring the pupil to draw on a mass of infor- 
mation and to select only those facts that are pertinent to the 
solution of the problem. The all-too-common ‘discussion’ ques- 
tion is illustrated in the two following ones: 


(1) Who was Patrick Henry? 
(2) Name four northern generals in the Civil War. 


A question requiring broad knowledge of two periods of history 
as well as ability to think is illustrated by the following question: 


Point out how some problems of war faced by Franklin D. Roosevelt 
are similar to those faced by Woodrow Wilson. 


If the teacher gives several examinations involving such broad 
questions, he has no reason to fear that he will secure a ‘sampling’ 
of the pupils’ knowledge, not to mention here their ability to 
think. 

Again, some writers attempt to discount the discussion, or 
essay, examination by pointing out that this examination as 
ordinarily administered does not afford time to pupils to organize 
and express their thoughts. This, of course, is an observable 
fact. It should be unnecessary, however, to point out that a 
customary misuse of an examination constitutes no argument 
for disuse of it. Accordingly, these writers should direct atten- 
tion, not toward attacking the discussion examination at this 
point, but toward urging teachers to make discussion examina- 
tions short enough and give them often enough for pupils to 
practice proper habits of self-expression. 

These commentators have sometimes added that the time for 
the development of pupils in discussion is outside the examination 
period. It is, indeed, important that pupils receive many and 
varied experiences in self-expression, or composition, between 
examination periods. In fact, only after pupils have engaged 
often in extended written expression, during directed study and 
in oral expression during the recitation, can they be expected 
to be prepared to express themselves reasonably well in the 
examination. It is clearly inconsistent, however, to hold that 
it is important for pupils to participate freely and extensively in 








166 The Journal of Educational Psychology 


organizing and expressing their thoughts prior to the examina- 
tion, but not during this period. To do so involves advocating 
a type of learning distinctly different from that on which the 
pupil is examined. 

This leads to consideration of the very important function 
of any kind of examination—that of stimulating and directing 
learning. As can be easily observed and as experimentation 
has shown, the pupil tends to form the study habits required in 
preparation for the type of examination given.' The pupil, 
as well as the teacher, makes the examination practically the 
end of education. If the examination calls for the memorization 
of unrelated, specific facts, the pupils consequently concentrate 
on this kind of learning; if, on the other hand, it requires relating 
facts, the logical organization of thoughts, the solution of prob- 
lems, and critical thinking, the pupils are likely to direct at 
least some effort toward this type of learning. This suggests 
very forcibly the importance of selecting an examination that is 
especially helpful in the promotion of the most important values 
in education. 

Thus it can be seen that the suitable, valid examination in a 
democratic program of education provides the pupil with oppor- 
tunities to take an active part in full and free expression of his 
own ideas based on wide and critical reading. This examination 
does not demand ‘passive’ responses of him in the form of his 
merely checking correct answers already expressed. On the 
contrary, it requires him to draw on his own fund of information, 
to select those facts that bear on the problem at hand, to recon- 
struct his thoughts on the basis of these facts, and to express the 
result in logical form. 

Examples of questions that could be used in this type of 
examination are as follows: 





1See as proof of this statement: George Meyer, ‘‘The Essay Type of 
Examination,’’ American School Board Journal, (December, 1934), Lxxxrx, 
pp. 17-18; E. C. Class, ‘‘The Effect of the Kind of Test Announcement on 
Students’ Preparation,’”’ Journal of Educational Research, (January, 1935), 
XXVIII, pp. 358-361; H. R. Douglass and Margaret Talmadge, ‘‘How 
University Students Study for New Types of Examinations,’’ School and 
Society, (March 10, 1934), xxxrx, pp. 318-320; Paul Terry, ‘‘How Students 
Review for Objective and Essay Tests,” Elementary School Journal, (April, 


1933), xxx11, pp. 592-603. 








Confusion in Educational Measurement 167 


I) Present arguments for or against free trade among countries. 
II) Compare conditions facing Abraham Lincoln during the Civil 
War with those facing George Washington during the Revolutionary 


War. 
III) Present and support reasons why the United States has pro- 


gressed further than the South American countries educationally. 

IV) Explain and illustrate the meaning of socialism. 

V) Criticize the positions of Woodrow Wilson and of Congress on 
the provisions for peace that they proposed at the end of World War I. 

VI) Point out modern trends in the relation of government to busi- 
ness, describing what are causing these trends. 

VII) From your extensive reading about democracy in the United 
States, what conclusions would you draw concerning what is required to 
improve its functioning in this country? 

VIII) Give your complete story of the significant events that occurred 
between the United States and Japan leading to the attack on Pearl 


Harbor. 


It should be reemphasized that prior to using such examination 
questions the teacher should spend period after period directing 
pupils in free and full expression, especially in written form. In 
history, for example, he could well suggest to the pupils that 
they write the story of the discovery of America or of the Con- 
stitutional Convention. As they are in the process of writing 
he would walk about the room questioning, assisting, and encour- 
aging each pupil. Likewise, in the assignment period and the 
recitation period, the teacher would concentrate on directing 
the pupils’ real discussion of the genuine problems before them. 
After they begin to show a reasonable degree of proficiency in 
thus expressing themselves under the direction of the teacher, 
they are ready to write independently. At this point the wise 
teacher does not announce an examination in an effort to scare 
them into cramming for it, but simply suggests that it is time 
for them to write without help. It is thus that the discussion 
examination can be made to become an integral part of the whole 
classroom procedure. 

After about twenty-five years of such confusion among the 
leading theorists in measurement, some teachers in public schools 
and colleges use the objective test and some the discussion exam- 
ination. Teachers who practice promoting traditional, or 
atomistic, learning are logical in making extensive use of the 
objective test. On the other hand, those who actually endeavor 





168 The Journal of Educational Psychology 


to follow a modern, or democratic, program of education, are 
consistent in using the genuine discussion examination. Unfor- 
tunately, the teacher may use the objective test merely because 
its form is something new or it is relatively easy to administer; 
or the essay examination, because he has formed the habit of 
using it. The point for the student of educational measurement 
to grasp is that the teacher should choose an examination in the 
light of his whole educational program and should make it an 
essential and integral part of this program. It is thus that the 
classroom teacher may avoid inconsistent practices comparable 
to the common inconsistencies among some of the leading 
theorists in educational measurement. 








THE INTERPRETATION OF 
FREQUENCY RATINGS OBTAINED FROM 
“THE TEACHERS WORD BOOK” 


FREDERICK B. DAVIS 


Codperative Test Service of the American Council on Education 


For many years, the frequency ratings in The Teachers Word 
Book' have been used by teachers, authors of textbooks, and 
test constructors as indicators of the difficulty of words. It has 
commonly been assumed that, by using only words having 
frequency ratings at certain selected levels, the difficulty of the 
material being prepared could be controlled. 

When Form Q of the Codperative English Test was prepared in 
1939, some interesting data pertaining to the relationship 
between The Teachers Word Book frequency ratings and actual 
difficulty of test items were obtained. The recent article by 
R. G. Simpson in the Journal of Educational Psychology concern- 
ing the vocabulary sections of this test suggests that the publica- 
tion of these data may serve to clear up some widespread 
misconceptions about the use of The Teachers Word Book 
frequency ratings.” 

The excellent description in The Teachers Word Book of the 
method used to obtain ratings for the words listed therein makes 
it clear that these ratings indicate the relative ‘frequency’ with 
which the words occur in a large sampling of English writing. 
A few moments’ consideration will indicate why the ‘frequency’ 
ratings in The Teachers Word Book should not be used indis- 
criminately as ‘difficulty’ ratings. First, with few exceptions, 
no separation of the relative frequency of the several meanings 
that a given word may possess is made in The Teachers Word 





1E. L. Thorndike, The Teachers Word Book of Twenty Thousand Words. 
New York: Bureau of Publications, Teachers College, Columbia University, 
1932 (revised). 

?R. G. Simpson, ‘‘The Vocabulary Sections of the Codperative English 
Tests at the Higher Levels of Difficulty,” J. Educ. Psychol., xxxtv (March, 
1943) pp. 142-151. 

Simpson refers to ‘‘the upper levels of difficulty of the Thorndike Word 
Manual.” (p. 144.) This substitution of ‘difficulty’ for ‘frequency’ is 
made by many people and results from a misunderstanding of the nature of 
the numerical ratings in The Teachers Word Book. 

169 





170 The Journal of Educational Psychology 


Book.* The word ‘bumper’ illustrates this point. As a noun, 
‘bumper’ may mean part of an automobile or a glass full to the 
brim; as an adjective, it describes an unusually good crop. 
Second, the per cent of the population that knows the meaning 
of a word changes with the passage of time. The word ‘bumper’ 
is a good example of this, too. Since most of the materials 
included in the word count underlying The Teachers Word Book 
frequency ratings were written, the word ‘bumper,’ meaning 
part of an automobile, has been used more and more frequently 
and its meaning has become known to almost every American 
beyond the nursery-school age. Hence, it is reasonable to sup- 
pose that the word ‘bumper,’ meaning part of an automobile, is 
now used far more often than one might be led to infer from an 
inspection of its frequency rating in The Teachers Word Book. 

In the course of constructing Form Q of The Codperative 
English Test, three different types of vocabulary item were 
prepared for experimental tryout: 

Type 1: Five-choice multiple-response items of the conven- 
tional sort. These ranged from very easy to extremely hard 
items. Following is a typical item of this sort: 


01. segregate 
01-A combine 
01-B disconnect 
01-C set apart 
01-D dissect 
01-E imitate 


Type 2: Completion items in which a definition of a word is 
presented together with the number of letters in the word 
defined. The pupil is required to write the word that he thinks 
is defined. These items are not, of course, machine-scorable. 
Following is a typical item of this sort: 


02. The price demanded in return for setting a captive free is a 
— (six-letter word) —. 


Type 3: Completion items in which a definition of a word is 
presented together with the number of letters in the word 


For this reason, work was initiated on the Semantic Word Count. 
Through the courtesy of Dr. Irving D. Lorge, a copy of this work was made 
available for use in the construction of the Codperative Reading Comprehen- 


sion Tests. 











Frequency Ratings from ‘‘The Teachers Word Book” 171 


defined. The initial letter of the word is also stated so that this 
kind of item is machine-scorable. Following is a typical item 
of this sort: 


03. Excessive pride in oneself and in one’s own ability is called — 
(seven-letter word) — 
03-A C 
03-B L 
03-C M 
03-D T 
03-E V 


These three types of vocabulary item were administered to 
secondary-school pupils to obtain data for item-analysis pro- 
cedures. For each item, both a difficulty index and a correlation 
coefficient between success or failure on the item and total score 
on the particular test in which the item was included were com- 
puted. The difficulty indices provide reasonably accurate infor- 
mation regarding the relative difficulties of the items among 
secondary-school and college students.‘ The difficulty indices 
may also be taken as serviceable approximations to the relative 
difficulties of the words whose meanings are tested. The 
difficulty of a multiple-choice vocabulary item for a given group 
of subjects is dependent on two main factors: first, the per cent 
of the group that could define the word correctly if asked to state 
its meaning; and, second, the degree of discrimination required 
to distinguish between the correct answer and the incorrect 
answers, or decoys, in the item. The importance of this second 
point has often been overlooked with unfortunate results. Test 
constructors have built items to test for knowledge of words like 
‘syzygy’ or ‘umbel.’ Such words have virtually no practical 
value except to specialists in certain learned professions; hence, 
they reduce the real validity of general vocabulary tests, but 


‘The Coédperative Reading Comprehension Tests: Information Concerning 
Their Construction, Interpretation, and Use. New York: Codéperative Test 
Service, 1940. p. 3 (col. 1). These difficulty indices include a correction 
to take account of the fact that some testees do not reach every item in a 
timed test and that many items are omitted by testees who have read them. 
The fact that items were selected for the final forms of the Codperative 
Reading Comprehension Test with due regard for a proper distribution of item 
difficulty indices explains Simpson’s discovery that ‘‘thewords . . . conform 
quite well to a symmetrical distribution in Table III.” 





172 The Journal of Educational Psychology 


they have been included to provide very difficult items in vocabu- 
lary tests that are not made up of items in which the decoys have 
been chosen with care and ingenuity so that they differ only 
slightly, though incontestably, from the correct meanings of the 
words being tested. 


TABLE I.—PRODUCT-MOMENT CORRELATIONS OF OBTAINED DIF- 
FICULTY INpICES’ AND “THe TEACHERS Worp Boox’”’ 


FREQUENCY RatiInGs® oF CERTAIN WORDS 
Stand- 
ard 
error of 
Corre- zero- Num- Mean «cof Mean cof 
lation order ber diffi- diffi- §fre- fre- 
coefi- corre- of culty culty quency quency 
Type of item cient lation items index’ indices rating ratings 
Type 1: Five-choice 
multiple-response 
vocabulary items  .10 .07 208 61.86 29.58 12.48 3.26 
Type 2: Definition of 
word stated with 
number of letters 
in word defined. 


Initial letter of de- 
fined word not 
given 19 .14 50 88.34 12.59 15.88 3.00 


Type 3: Definition of 

rather easy word 

stated with num- 

ber of letters in 

worddefined. In- 

itial letter of de- 

fined word not 

given .05 .12 73 51.55 29.28 6.38 1.56 
Definition of rather 

hard word stated 

with number of 

letters in word de- 

fined. Initial let- 

ter of defined word 

not given .14 13 59 63.00 24.15 8.01 3.88 


5 The higher the difficulty index, the harder the item. 

* The higher the frequency rating, the less common the word. 

7 The difficulty indices for items of Type 1 and Type 3 were corrected for 
chance, since the testee has one chance in five of getting a five-choice item 
correct by random marking. 








Frequency Ratings from ‘‘The Teachers Word Book” 173 


Because the difficulty of a multiple-choice vocabulary item is, 
in part, dependent on the choices provided in it, the difficulty 
indices of the vocabulary items of Type 1, as described above, 
must be regarded merely as reasonably serviceable estimates 
of the true difficulties of the words tested in the sample of sec- 
ondary-school pupils examined. Items of the second and third 
types require the testee to supply the correct answer. Since the 
correct answers to these items are not presented for recognition, 
they tend to be more difficult than items of the first type. It is 
harder to supply a word to fit a definition than it is to define a 
given word. An element of chance involved in happening to hit 
the keyed response seems to be involved, too, so that the dif- 
ficulty of an item of either of these types is likely to vary con- 
siderably from one individual to another even in a highly literate 
group. 

With the limitations of the data in mind, we may now examine 
the correlations between The Teachers Word Book frequency 
ratings of certain words and the obtained difficulty indices of 
items constructed to test for their meanings. 

The data in Table I indicate that there is only a slight tendency 
for test items to be more difficult as the words being tested become 
less frequent according to the word count on which the frequency 
ratings in The Teachers Word Book are based. None of the four 
correlation coefficients reported in Table I is significantly differ- 
ent from zero. It is reasonable, therefore, to conclude that the 
frequency ratings in The Teachers Word Book should not be 
interpreted as indications of the difficulty of words. There is 
nothing in The Teachers Word Book to justify such an interpre- 
tation; subjective considerations discourage it; and the data in 
Table I are not in agreement with it. 

This does not mean that test constructors may not want to 
select the words to be used in vocabulary tests from The Teachers 
Word Book; this volume provides a convenient and useful list 
of words for many purposes. It does mean that if test items are 
to be arranged in order of difficulty, a numerical index of difficulty 
must be obtained experimentally, as was actually done for every 
item in the Codperative Reading Comprehension Tests. In some 


8 Frederick B. Davis, et al., The Codperative Reading Comprehension Tests, 
Lower and Higher Levels, Forms Q, R, 8, and T. New York: Codperative 
Test Service, 1940-1943. The Codperative Reading Comprehension Tests 





174 The Journal of Educational Psychology 


circumstances, it is undesirable to arrange test items in order of 
difficulty. This is the case, for example, when the items are 
grouped in repeating scales, because testees might then be able 
to defeat the object of using this ingenious technique, which is 
to obtain real ‘power’ scores unaffected by the rate at which 
the testee chooses to work, by deliberately omitting the more 
difficult items toward the end of each repeating scale.® 

The data in Table I indicate that, while teachers and authors 
of textbooks may find it desirable to restrict the words in mate- 
rials being prepared for school use to certain frequency levels of 
The Teachers Word Book, they should be aware that by so doing 
they are probably accomplishing very little toward controlling 
the actual ‘difficulty’ of the materials. 





constitute a part of the Codperative English Test, but they are published 
separately as eight forty-minute reading tests that yield comparable scores. 
The recognition-vocabulary items in these eight tests are arranged in order 
of difficulty. Those who have prepared copy for printing will understand 
the practical necessity for occasionally shifting the order of items to meet 
the mechanical requirements of paging. 

* Frederick B. Davis, The Codperative Vocabulary Tests, Long Forms 
Q and R; Short Forms Qs and Rs. New York: Codperative Test Service, 
1941-1942. 

These tests exemplify the use of the repeating-scale technique. The 
items are arranged in repeating scales of thirty items; the items are, there- 
fore, not placed in order of difficulty. In fact, a rather easy word was 
deliberately placed last in each repeating scale in order to increase the 
practical efficiency of the scoring procedure. The items in the paragraph- 
reading section of each form of the Codperative Reading Comprehension 
Tests are also grouped in repeating scales of thirty items and are not, there- 
fore, arranged in order of difficulty. 








RELIABILITY OF MULTIPLE-CHOICE TESTS 
AS A FUNCTION OF NUMBER OF CHOICES PER 
ITEM 


FREDERIC M. LORD 


U. 8S. Civil Service Commission 


It is a generally recognized fact that changing the number of 
choices per item in a multiple-choice test tends to change the 
reliability of the test, other things being equal. A formula 
expressing, under certain circumstances, the exact amount of the 
change in reliability is derived in the following pages and sub- 
jected to experimental validation. 

The Spearman-Brown prophecy formula, 


Nr 
eS T+ WW = Dre 7 





provides an exact expression for the correlation (r;,) between the 
sum of any N variables and the sum of any N other variables if 
the correlations (r;;) between pairs of variables, and likewise 
the standard deviations of the variables, are all equal. If an 
individual’s score on a test is the sum of his scores (commonly 0 
or 1) on the N items in the test, the reliability of the test (the 
correlation between the test and a hypothecated alternate form 
of the test, equivalent in that the items in the two tests meet the 
stated assumptions underlying equation 1) is expressed by equa- 
tion 1 in terms of the correlation between the items in the test. 
The necessary assumptions are that the interitem correlations, 
and also the standard deviations of the scores on the items, shall 
be equal. 

It is desired to find the effect on rz of changing the number of 
possible responses to the test item. Assuming that the item 
scores are dichotomous variables, the interitem correlation is the 
fourfold point correlation given by the formula 


“en ad — be (2) 
V/(a + b)(a + c)(b + d)(c + d) 








where a, b, c, and d are the proportions of the total frequency in 
the four cells of the fourfold correlation table. Assuming that 


the percentage of examinees answering an item correctly is the 
175 





176 The Journal of Educational Psychology 
same for all items (p = a + b = a +c), equation 2 becomes 


a—p 
3 
— (3) 





‘i= 


where g = 1 — p. 

In order to determine the effect upon 7;; of changing the number 
of item choices, let it be assumed that each individual answers 
every item in the test and that each item has n choices all of 
which are equally plausible to, and hence equally often chosen by, 
those individuals who do not know the correct answer to the item. 
Let A, B, C, and D, respectively, be the values that would be 
found instead of a, b, c, and d if all those individuals who did not 
know the correct answer to an item were to get that item wrong. 
Then, by the usual laws of probability, since under the assump- 
tions made the chance of getting the right answer to an item by 
guessing is 1/n, 





gu Gp, (4) 
e=b="—* p+"! D and (4) 
a=A+e4047. (4) 


If the number of choices to each item is changed to n’, it is 
found by similar reasoning that 


a= A+ 54642, (5) 


where the primes denote values pertaining to the revised items. 
Eliminating A, B, C, and D from equations 4 and 5 and remem- 
bering that b = c and that a +b +c+d = 1, it is found that 


a’ =a — 2kb+k*d = (1+ k)*a — 2k(1 +k)p+k?, (6) 
where k = heat Likewise it is found that 
n'(n — 1) 
p’ = p — kg. (7) 


If the expressions found in equations 6 and 7 are substituted 
in equation 3, the value of the correlation between the revised 
items is found to be 











4 , 


Multiple-choice Tests and Number of Choices per Item 177 
= PPE 2) ow BTR SE nm fry, (8) 





ry = 
' (l-—p)(p—kg) p-—kq pq 
p + pk 
here f = ———- 
where f o- a 
It can now be shown from equations 1 and 8 that 


So Nfri; : 
ror 1 + (N —_ 1) fri; (9) 





If equation 1 is solved for 7;; in terms of rx, and the resulting 
expression substituted for r;; in equation 9, formula 10 is obtained. 
This formula expresses the reliability of the revised test in terms 
of the reliability of the original test and the value f, which is a 
constant determined by the percentage of individuals answering 
the items correctly and by the number of choices per item in the 
original and in the revised test. 





Nfru ( 1 0) 


re -N+N-DG—Dre 


Remmers! has suggested that rx may be predicted by direct 
application of the Spearman-Brown prophecy formula, 


M Te 


- l + (M —_ 1)ru (11) 





Tr 


letting M =n’'/n. This formula lacks any clear theoretical 
justification and is open to the objection that for a fixed value of n 
the predicted reliability approaches 1.00 as n’ becomes large. 
Such a prediction is unreasonable since the reliability of a test 
containing a given number of items has an upper limit less than 
1.00 fixed (see formula 1) by the reliability of the individual 
items, and no item can be made to approach perfect reliability 
simply by the inclusion of additional choices in the item. A 
further objection is that formula 11 does not take into account the 
difficulty of the test. The increase in test reliability resulting 
from an increase in the number of choices per item must be to 
some extent a function of item difficulty; since this increase 
derives from a reduction in the number of correct answers 
achieved by guessing, and since the amount of guessing actually 
occurring depends on the difficulty of the test. 








178 The Journal of Educational Psychology 


An empirical check of formula 11 was made by Remmers’, 
utilizing for this purpose certain data obtained by Ruch and 
Stoddard? and by Ruch, DeGraff, and Others*; and these same 
data have been used to make an empirical check of formula 10. 
The value of p used in applying formula 10 was the average of 
the percentages of correct answers to the items, this average being 
obtained by dividing the mean score on the test by N. The 
results of the empirical check, including those of Remmers, are 
presented in tabular form below. That part of the Ruch-DeGraff 
data relating to tests administered under explicit instructions to 
the examinees not to guess has not been utilized in the present 
study since one of the assumptions basic to formula 10 is that 
every examinee selects an answer to every item. 

In that part of the Ruch-Stoddard study relevant to the present 
purpose, two forms of a fifty-item history and social science test 
were used, each item in each form having been prepared as a five- 
response, a three-response, and a two-response multiple-choice 
item. Three groups supposedly equal in ability, each consisting 
of one hundred thirty-five high-school seniors, were tested, one 
group being given both forms of the five-response test; a second, 
both forms of the three-response test; and the third, both forms 
of the two-response test. The reliability of each of the three tests 
was computed from the scores on the two forms. 

In that part of the study by Ruch, DeGraff, and Others that is 
relevant to the present purpose, two forms of a 100-item United 
States history test were used, each item having been prepared as a 
seven-, five-, three-, and two-response item. Four groups 
approximately equal in ability, each consisting of approximately 
two hundred forty pupils in grades VII, VIII, XI, and XII, 
were tested, each group being given two forms of one of the four 
multiple-choice tests. The reliability of each of the four multi- 
ple-choice tests was computed from the scores of the two forms. 

It is seen that formula 10 yields better predictions than formula 
11 in two out of six cases on the Ruch-Stoddard data and in 
eleven out of twelve cases on the Ruch-DeGraff data, although 
the differences between the predictions obtained from the two 
formulas are in many cases slight. The agreement between pre- 
dicted and actual reliabilities is on the whole not particularly 
good, however. Remmers and Others, using formula 11, have 
obtained good agreement in a more recent set of studies*®®’ 











Multiple-choice Tests and Number of Choices per Item 





179 





Ruch-Stoddard Data 





Actual data on test from which 


Actual data on test 
for which reliabil- 


Reliability pre- 
































prediction is to be made ity is predicted dicted from 
(average 
percent- 
(number age of (reli- (number (reli- formula | formula 
of choices bili of choices bili i 
» item) correct | ability) per item) ability) 10 
pe answers 
per item) 
n p Te n' Te 
2 .676 . 737 3 . 598 .821 .808 
2 .676 . 737 5 .796 .871 .875 
3 .570 . 598 5 .796 .680 .713 
2 (a) . 682° 3 .675° .777 . 763 
2 (a) . 682° 5 775° . 836 . 843 
3 (a) .675° 5 773° . 749 .776 
Ruch-DeGraff Data 
2 .6945 .745 3 .837 .823 .814 
2 .6945 .745 5 . 864 . 868 .880 
2 .6945 745 7 .800 . 884 911 
3 . 5885 .837 5 .864 . 880 .895 
3 .5885 .837 7 . 800 .895 .923 
5 .4985 .864 7 .800 .881 .899 
2 (a) . 864° 3 .858° 911 .905 
2 (a) . 864° 5 . 902° . 937 .941 
2 (a) . 864° 7 . 839 . 946 .957 
3 (a) . 858° 5 . 902° . 896 .910 
3 (a) . 858° 7 . 839 .910 . 934 
5 (a) . 902° 7 . 839 .915 .928 























(a) The appropriate values of p from the immediately preceding lines 
of the table were used. 

(6) These reliabilities were calculated from scores corrected for guessing 
by the formula Score = R — W/(n — 1). 





180 The Journal of Educational Psychology 


designed and carried out by the authors—a fact which would 
indicate that the Ruch-Stoddard and Ruch-DeGraff data may 
not be suitable for the present purpose. Unfortunately, formula 
10 cannot be checked by means of the data in Remmers’ more 
recent studies (nor by means of a third set of data utilized in the 
earliest of his articles) since the mean scores on the tests adminis- 
tered are no longer available, and hence there is no way of esti- 
mating the value of p to be used in formula 10. 

In conclusion, it would appear that the theoretical derivation 
of formula 10, as given on the preceding pages, together with 
the experimental results presented in the table, provide a basis for 
confidence in the utility of this formula. Further experimental 
validation of the formula is much to be desired. 


BIBLIOGRAPHY 


1) Remmers, H. H., et al.: “ Reliability of Multiple-choice Measuring 
Instruments as a Function of the Spearman-Brown Prophecy Formula, 
I” Journal of Educational Psychology, Vol. xxx1, No. 8, November, 1940, 
pp. 583-590. 

2) Ruch, G. M. and Stoddard, G. D.: “Comparative Reliabilities 
of Five Types of Objective Examinations” Journal of Educational 
Psychology, Vol. xv1, 1925, pp. 89-103. 

3) Ruch, G. M., et al.: Objective Examination Methods in the Social 
Studies, Chicago: Scott, Foresman, and Co., 1926, pp. 54-89. 

4) Remmers, H. H. and Denney, H. R.: “Reliability of Multiple- 
choice Tests as a Function of the Spearman-Brown Prophecy Formula, 
II” Journal of Educational Psychology, December, 1940, pp. 699-704. 

5) Remmers, H. H. and Ewart, Edwin: “ Reliability of Multiple- 
choice Measuring Instruments as a Function of the Spearman-Brown 
Prophecy Formula, III”’ Journal of Educational Psychology, January, 
1941, pp. 61-66. 

6) Remmers, H. H. and House, J. Milton: ‘Reliability of Multiple- 
choice Measuring Instruments as a Function of the Spearman-Brown 
Prophecy Formula, IV” Journal of Educational Psychology, May, 1941, 
pp. 372-376. 

7) Remmers, H. H. and Adkins, R. M.: “Reliability of Multiple- 
choice Measuring Instruments, a Function of the Spearman-Brown 
Prophecy Formula, VI” Journal of Educational Psychology, May, 1942, 


pp. 385-390. 








TECHNIQUES USED IN ANALYZING THE 
LEARNING ACHIEVEMENT OF NAVAL 
AVIATION CADETS* 


ELLIS WEITZMAN, LIEUT., U.S.N.R. 
AND 
WALTER J. McNAMARA, LIEUT., U.S.N.R. 


The general functions of the Central Examining Board of the 
Training Division, Office of the Chief of Naval Operations, 
have been outlined elsewhere. Some of the specific needs 
confronted by the Board, and how they have been dealt with, 
are, perhaps, worthy of more detailed discussion. .There is, 
for example, the necessity for analyzing and reporting upon the 
findings with respect to which specific areas of the subject-matter 
have been mastered adequately by the cadets, and just which 
portions of the various courses are in need of greater emphasis. 

For reasons of practical expediency, the examinations mailed 
out to the many training centers by the Central Examining 
Board are accompanied by punched scoring stencils which 
enable the instructors to hand-score the answer sheets immedi- 
ately. This is necessary because it must often be known within 
a matter of hours those cadets who are prepared to proceed 
to the next phase of training. 

The Navy scoring system is used throughout the program. 
This system consists of a perfect score of 4.0, with 2.5 (or 62.5 per 
cent) required for passing. Therefore, with few exceptions, 
the weekly examinations consist of forty multiple-choice items. 
This simplifies the hand-scoring by instructors (and lessens 
the likelihood of scoring error), since each question counts 0.1 
on the Navy scoring basis. No conversion tables or calculations 
are required. Immediately after local scoring has been com- 
pleted, all answer sheets are mailed to the Central Examining 
Board for re-check and analysis. 

I.B.M. test-scoring machines are used at the Board. Any 
discrepancies found to exist between local scores and the Board’s 





* The opinions or assertions contained herein are the private ones of the 
writers and are not to be construed as official or reflecting the views of the 
Navy Department or the naval service at large. 

1E. Weitzman, and R. C. Bedell. ‘The Central Examining Board for 
the Training of Naval Air Cadets.” Psychol. Bull., 1944, 41, 57-59. 

181 





182 The Journal of Educational Psychology 


machine scores are hand-checked. The schools are then notified 
of any changes they must make in their records. This is always 
done, of course, when the magnitude of a scoring error is sufficient 
to make an important difference in the cadet’s standing. Any. 
recurrence of minor scoring errors is also pointed out in order to 
insure more care in the subsequent handling of answer sheets. 

Distributions of scores for each individual school or station 
are then prepared, as well as a total distribution for all the schools 
in a particular phase of the training program. The mean, the 
interquartile range, and the total range are presented graphically 
for each school and for the total population. A table is also 
prepared giving for each school the mean score, the percentage 
passing, and the percentage failing, i.e., percentages over and 
under 2.5. 

The next step in the test analysis is the preparation of an 
item-analysis for each test question. A random sampling of 
test papers is made, using the following technique: Knowing 
the mean for the population, four schools are selected, these 
four having mean scores nearest to the mean of the general 
population. Fifty answer sheets are randomly selected from 
each of the four schools chosen. The answer sheets from the 
four schools are then placed in two groups of one hundred each. 
By means of the item-analysis unit of the test-scoring machine, a 
graphic tabulation for each group of one hundred papers is 
printed by the machine which shows the number of individuals 
selecting each of the responses to each question. Using one 
hundred papers has the advantage of giving percentage figures 
directly, thus eliminating the need to calculate them. The 
use of two groups of one hundred each results in a ready measure 
of reliability. 

As an indication of the statistical validity of the test items, 
the test papers of the highest-scoring twenty-five per cent and 
of the lowest scoring twenty-five per cent from each of the four 
schools selected are pooled. A tabulation is made by means 
of the item-analysis unit of the scoring machine which gives 
the percentage of correct responses on each question for each of 
these two extreme groups. The difference between the per- 
centages of correct responses of the high and low groups provides 
a readily-derived and exceedingly useful index of the discriminat- 
ing value of the questions. 








Analyzing Learning Achievement of Naval Aviation Cadets 183 


The above procedure is employed in lieu of a more statistically 
refined technique because of the large volume of material to be 
analyzed, and because it permits using the clerical staff ordinarily 
available. This technique, moreover, has been found to be 
statistically adequate for the purposes involved. 

In order that the professional staff may make convenient use of 
test-analysis findings in issuing reports and revising examinations, 
these data are posted on a specially-printed 5-by-8 card which 
also has the item concerned pasted on the card. By referring 
to this card alone, the test constructor may note the difficulty 
of the question, its discriminating power, and precisely which 
wrong concepts are most prevalent among the cadets. It is 
also possible to determine from the cards which incorrect choices 
were seldom selected and consequently in need of revision. 

If it is found that a certain choice is being selected by an 
extremely small percentage of cadets, it is either revised or 
replaced by a totally new choice. If the item itself is weak 
in that it shows a small percentage of discrimination between 
high and low scores, it is either altered or eliminated from further 
use. The only exception to this is in the case of an item which 
has great instructional value because it emphasizes some impor- 
tant and fundamental concept. Poorly-discriminating items 
are occasionally retained because of their instructional value. 

A further value of the analysis data is that they permit the 
test constructor to control the general difficulty level of the 
examinations. A test may thus be increased or decreased in 
difficulty by working from a known point of reference. Proper 
selection and revision of analyzed test items make possible a 
considerable degree of accuracy in regulating the percentage of 
cadets failing the examinations. Since preliminary tryout of 
complete tests has not been feasible, this procedure has been 
indispensable in the refinement of revised and new tests issued 
by the Board. 

In the matter of reporting the test results to the administra- 
tive command, and to the schools involved, speed is of the 
essence in order for the analyses to be of maximum usefulness 
in training the succeeding battalions of cadets who enter in a 
continuous flow. This is mentioned because it is an essential 
part of the framework into which the analysis and reporting 
methods must fit. 





184 The Journal of Educational Psychology 


The test analysis reports consist of two sections, which may be 
designated as ‘qualitative’ and ‘quantitative.’ The qualitative 
portion of the reports presents a discursive summary of the 
ideas and principles in the light of how well they have been 
mastered by cadets. Emphasis in the report is placed upon 
those principles which have been found least well learned by 
cadets. This is in order that these deficiencies may be quickly 
remedied. The report indicates not only which principles have 
not been adequately grasped, but the specific nature of the errors 
and misconceptions is discussed. The qualitative section of 
the report also points out those areas of the subject-matter 
which have been extremely well learned. (In all cases, the 
percentage of cadets mastering each principle is given.) This 
enables the instructor to judge from which portions of the course 
the additional allotment of instructional time and emphasis 
may best be drawn. 

The reports do not refer directly to particular test questions; 
instead, they refer to the facts or principles covered by the 
questions. The reason for this practice is that it is not intended 
that instructors teach specific answers to specific questions. 
The aim is to have the instructor bring about a ‘comprehension’ 
of the principles involved rather than mere memorization. 
Stating the findings in terms of particular test items would be 
of no value to instructors in any event, since alternate examina- 
tion forms (covering the same ideas in other ways) are being 
issued constantly. 

The quantitative section of the report consists of tabular 
material and graphs presenting the mean score for the total 
population and the percentage of the total number of cadets 
passing the examination (i.e., attaining a score of 2.5). The 
same information is also presented for each individual school 
in the study. This table permits each school to determine its 
standing relative to all others in the same category. Schools 
falling far below the mean are thus stimulated to renewed efforts. 
They are also enabled to judge the extent of their success against 
a background of national norms. 

From the above, it is evident that the tests serve a double 
purpose in that they not only assess the learning achievement of 
cadets, but provide diagnostic data of great value in determining 
learning difficulties of cadets in general. This does not imply 








Analyzing Learning Achievement of Naval Aviation Cadets 185 


that the tests are a diagnostic instrument with respect to the 
individual cadet. The manner in which the tests are con- 
structed and the techniques used in preparing the reports are, 
however, such that the tests also have diagnostic and remedial 
utility. 

This reporting method has been found quite worth while in 
all stages of a varied training program which ranges from rela- 
tively academic ground school phases to later stages in which 
primary emphasis in training cadets is upon actual flight per- 
formance. The subjects in which this analyzing and reporting 
technique has been used advantageously include Mathematics, 
Physics, Theory of Flight, Civil Air Regulations, Aerology, 
First Aid, Seamanship, Aircraft Engines, Flight Maneuvers, 
and others. The curriculum covered varies all the way from 
orthodox classroom instruction in the more academic subjects 
to ‘on the line’ instruction in Aircraft Engines, daily weather 
observations in Aerology, and flight training in the air. Despite 
the range of training, it is felt that the above techniques have 
served quite well. They are, further, a good illustration of the 
manner in which traditional testing procedures may be adapted 
to meet wartime requirements. 








BOOK REVIEWS 


Epwarp K. Srrone, Jr. Vocational Interests of Men and 
Women, Stanford University, California: Stanford Univer- 
sity Press, 1943, pp. xxrx + 746. 


In this single volume Strong has summarized some twenty 
years of research,—both his own and that of his students,—on 
the measurement of interests. A great deal of information not 
previously published is presented and all of the data are eventu- 
ally brought to bear upon the practical problem of vocational 
choice. Part One is a general introduction in which the author 
states his views about the nature of interests, reviews briefly 
the conclusions to the entire volume, describes in detail his 
vocational interest blanks, and emphasizes the neglected fact 
that people in the same culture have a great many interests in 
common. 

Part Two summarizes the evidence for the possibility of 
differentiating men in terms of expressed occupational interests. 
Three non-occupational scales are described in Part Three 
which scales are used to contrast the interests of upper and 
lower socio-economic groups of men and women and of fifteen 
and twenty-five year old males. The final chapter in this 
third section identifies the various factors that underlie interest. 
Part Four describes the use of interest tests in guidance. Part 
Five indicates what has been done to differentiate the inferior 
from the superior both occupationally and scholastically. In 
Part Six Strong suggests a new procedure which may be ade- 
quate to study the interests of persons in the skilled trades. 
The concluding section describes what has to be done to con- 
struct and score an interest inventory. 

Strong’s entire argument bears upon two different problems 
involved in interest measurement: first, the differentiation of 
occupational groups; and, second, assigning individuals to one 
or more groups depending upon their interest scores. The 
author believes that his interest scale does both of these tasks 
reasonably satisfactorily. The reliability of the instrument, 
of course, has been well established, and the coefficients reported 
usually fall between .85 and .90. Even a reliability of this. 
magnitude, however, leaves much to be desired when the scale 
is used for individual prognosis. 

186 








Book Reviews 187 


Many questions of fundamental importance in appraising the 
interest inventory were answered by correlation data. Some 
of these questions had to do with the permanence of interest, 
the prediction of occupational success based upon interest 
score, the similarity of groups, and the relations between var- 
ious measures of interest as well as between various occupational 
groups. The reviewer at times was dissatisfied with the inter- 
pretation of coefficients of a certain magnitude. For example, 
when the author described the relationship between scores on 
his interest inventory and measures of intelligence as being 
about the magnitude of .40 (p. 49) the contention was that this 
represented a slight degree of association. Later on it was 
contended that a correlation of .73 was usually considered a 
high coefficient. Any criticism of this sort is not so serious 
as it might be in view of the fact that the author always gave 
the raw correlation so that a careful reader could make his own 
interpretation. 

Strong attends at a number of points to the relation that exists 
between interests and abilities, although in the last analysis 
he is frank to state that crucial evidence is lacking. (p. 512). 
In his view, interest is an “indeterminate indicator” of success 
(p. 14) largely because interest signifies satisfaction but not 
necessarily superior achievement. In another place (p. 17) 
ability is likened to the motor in a boat and interest to the 
rudder. Of a large number of coefficients of correlation measur- 
ing the statistical relationship between interest and intelligence, 
which is one aspect of the problem of the relationship between 
interest and ability, eighty per cent of the coefficients fell between 
plus and minus .30 (p. 331 f.). Eventually Strong takes the 
common-sense view in this connection that a “dichotomy is 
unnecessary; true counseling must consider both interests and 
abilities.” (p. 411.) Interests and abilities must be considered 
simultaneously because there would be little value in considering 
occupational interests that would seem to be unsupported by any 
promise of talent. (p. 412) 

The reviewer has always been somewhat skeptical of the use of 
interest inventories under certain conditions because of the fact 
that it is easy for the subject to react to them in such a way as to 
mislead. Strong deals with this problem under the heading, 
“Faking Scores” and the evidence is clear that an adult can, if 














188 The Journal of Educational Psychology 


he wishes, check those interests which will give him a high rating 
on a desired occupation (p. 686 f.). Strong’s attitude toward 
this problem is expressed by the contention that ‘‘under normal 
conditions there is little or no apparent temptation to fudge.” 
(sic) (p. 687). The argument behind this hope was not too con- 
vincing to the reviewer. It reminded him again of the great gap 
that exists between highly refined psychological tests like 
Strong’s questionnaire and the measuring instruments that are 
employed by the physical scientists. The latter would have 
little confidence in a scale that would let a man weigh what he 
wanted to weigh even if under most circumstances men would 
want to know the truth (?) about their avoir du pois. 

The student who wants to bring himself abreast of an important 
area which has not been summarized since Fryers’ Measurement 
of Interests, 1931, will find Strong’s volume indispensable. Not 
only is it a 750-page storehouse of research data, but it also is full 
of hypotheses and questions which will be most provocative to 
research students in the field. STEPHEN M. Corey. 

University of Chicago. 


HELEN M. Waker. Elementary Statistical Methods. New 
York: Henry Holt and Co., 1943, pp. 368. 


The material in this text is based upon the needs of student as 
revealed by years of classroom teaching. The major emphasis 
is upon acquainting the student with a variety of statistical tech- 
niques, noting the limitations and relative advantages of each ' 
with the assumptions upon which it is based. The place each 
technique occupies in a logical analysis of the data and the inter- 
pretations that can be derived from it are explained. Provision 
is made for the variation in mathematical background of students. 
For those with a broad training, mathematical notes are provided 
in the appendix. 

Many noteworthy features of the book should receive special 
mention. The exposition is simple and clear. Underlying con- 
cepts and interpretation receive special attention. Since the 
material is in harmony with recent developments in the mathe- 
matical theory of statistics, it will furnish a sound foundation 
for more advanced study. The treatment of symbolism is ade- 
quate and exceedingly helpful. Putting exercises at appropriate 
places in the text is an effective teaching method. Particularly 











Book Reviews 189 


noteworthy are the discussions of the normal curve and of sam- 
pling. Throughout the organization is such that the text is 
almost self-teaching. 

Some instructors will be disappointed by apparent incomplete- 
ness along certain lines: (1) The treatment of reliability of differ- 
ences and of test scores is meagre, (2) Test and item validity as 
well as multiple and partial correlation are not discussed or are 
barely mentioned. Presumably many of these items will be 
covered in the author’s forthcoming Book II which will be 
devoted to further development of statistical inference. 

This excellent book will find wide use. Many instructors will 
welcome it as just the text they have been waiting for. 

Mixes A. TINKER. 


University of Minnesota. 


J. T. MacCurpy. The Structure of Morale. New York: The 
Macmillan Company, 1943, pp. 219. $2.00 


When future historians come to the question of credit for the 
directions taken by World War II they will certainly consider the 
contribution of another man behind the man behind the gun: 
the psychologist. Much has been said of the use of psychology by 
the Germans to turn the war in their favor. More is to be heard 
of the psychologists of the United Nations. Historians will 
especially appreciate the explanation of British army practice to 
be found in The Structure of Morale, a series of lectures delivered 
by J. T. MacCurdy to successive classes of British army officers, 
in whose hands rests the responsibility for morale among men 
who must fight this war. 

MacCurdy’s lectures are an apparently successful excursion 
into the realm of the plausible conjecture. Drawing upon the 
known reactions of lower animals under conditions of intense 
fear and strain, this British psychologist has predicted the effects 
of war’s primitive environment upon civilized man. Although, 
as he himself is careful to point out, analogy is always a danger- 
ous business, the experiences of World War II so far seem to bear 
out his conclusions. His errors, if any, will doubtless become as 
evident. 

The first of the three parts in which the lectures appear is 
particularly interesting to the armchair strategist who has 
followed the mass bombing of Berlin and other key cities of the 





ve ws dl 


190 The Journal of Educational Psychology 


Reich and German-occupied Europe. Here, in a discussion of 
passive adaptation to danger, MacCurdy shows the error of 
Germany’s token bombing of English cities in the fact that 
immediate eye-witnesses of the destruction, the most susceptible 
to panic, were few, whereas the many remote from the targets 
of a token raid were only fortified by their escape and angered to 
greater unity by the fact of destruction. The Berlin bombings 
are clearly designed to bring ruin close to the door of every home, 
and, hence, fear and incapacity for effective action to its occu- 
pants. If the raids do not achieve this objective, the result will 
appear to be not so much a denial of MacCurdy’s thesis as it will 
a tribute to the backlog of German morale and the greater fear 
of the Gestapo and a future in a world of Allied victory. 

Regarding active adaptation to danger, the author discusses 
the importance of successful past experience to such tasks as 
unexploded-bomb disposal, the advantage of the aggressive 
posture and expression in bayonet action, and the psychological 
need for a perfected, automatic response in the face of danger. 
He points out that drill, for instance, the bane of the new recruit’s 
existence, establishes an habitual response to orders which, in an 
emergency, provides the otherwise baffled soldier with a feeling 
of security. It is noted that the German soldier’s dependence 
upon the hand-grenade for his personal security has demonstrated 
its obvious weakness in the ready surrender of individuals whose 
grenade supplies have become exhausted. 

The discussion of morale in Part II of the book presents man. 
as an animal whose responses and values vary with the society in 
which he happens to be; further, as part of a national group whose 
morale is largely determined by the history of that group. 
Ignorance of this second point is credited with Germany’s mis- 
calculation of Russian endurance; and Japan’s, of Chinese. 
Both the Russians and the Chinese were accustomed to genera- 
tions of physical discomfort, malnutrition, and hardship; these 
had been survived; hence more of the same through an external 
enemy was not a matter for overwhelming fear but for stubborn 
resistance-as-usual. MacCurdy also shows the fallacy in British 
thought in underestimating the importance of force, which the 
Germans revere, and exposing the island to the very near reality 
of unopposed invasion. 

Interesting to the eyes of an American reader is the portion 
of the book dealing with the British imperial problems and the 











Book Reviews 191 


conclusion that Britain should continue paternalism until she has 
educated the ‘backward peoples’ to her ideals of majority 
responsibility and minority rights. The author’s judgment 
seems to be that such evolution in the social values of African 
and Asiatic peoples will be a matter of generations. Obviously, 
the publication of the date of liberation, which certain prominent 
Americans have so ardently espoused, would be depressing and 
scarcely beneficial to the ‘‘structure of morale.” 

In Part III, a treatment of problems of organization styles the 
distinction between dictatorship and democracy as a matter of 
degree of freedom rather than of essentially different organization, 
and avers the need for liaison officers in large organizations, the 
tendency for employees (especially those recruited from the ‘lower 
classes’) to hold the department or their careers above the good 
of the whole concern, the desirability of a somewhat flexible but 
largely permanent, specially trained ruling class, and the need 
for better opportunities in scientific research in the public service. 
A dictatorship is defined as a machine so efficiently geared to a 
practical purpose that pure science without immediate practical 
application is ruled out. In this definition the author sees the 
downfall of Germany, for in a long war against nations which are 
constantly supplied with new resources through research, the 
dictator nation must eventually be outstripped. His argument 
that a hierarchical organization is incompatible with evolution 
is not convincing in that it seems to carry too far the comparison 
of a dictatorship with a machine assembly plant and to minimize 
man’s adaptability to changes in activity and organization. 

The author in this third section of the book gives the impression 
of having changed from the objective, universal psychologist into 
an Englishman seeking earnestly to find psychological justifica- 
tion for the perpetuation of his society and his institutions in their 
present form. This transformation will prove disappointing to 
the reader who has in the first two parts enjoyed the author’s 
courageous application of analogy and logic to the problems of 
fear and morale in total war and who would like to believe the 
entire book a work approaching that which is fondly called ‘the 
truth.’ The absence of footnotes, acknowledgement of quota- 
tions, and bibliography does not increase confidence. 

Constance M. McCu.iouau. 


Western Reserve University. 





192 The Journal of Educational Psychology 


ALFRED StTaFrrorpD CLayton. Emergent Mind and Education. 
New York: Bureau of Publications, Teachers College, 
Columbia University, 1943, pp. 179. 


This monograph is an example of exegesis in the field of the 
social studies. Mr. Clayton read many of George Herbert 
Mead’s books, essays and articles and abstracted from them all 
arguments and discussions that dealt with the emergence of 
reflective intelligence from what Mead insisted were prior non- 
mental conditions. All of these arguments and discussions were 
then related to some of the central problems of education. 

Those who are interested in personality development and in the 
social implications of the educative process, will be stimulated 
by many of Mead’s assertions. He believed that constitutional 
democracy constituted the best guarantee of a good state, and 
that the scientific method should be employed in working on the 
problems inherent in any dynamic social order. Responsibility 
in Mead’s view is a concomitant of freedom and the citizen should 
grow in responsibility as he exercises more and more freedom. 

STEPHEN M. Corey. 


University of Chicago. 











